Initial upload of VQSR workflow #470

LindoNkambule · 2022-07-22T19:40:30Z

Hi @jkgoodrich , @konradjk, cc @wlu04. Please have a look at the VQSR workflow and review.

I decided to put resources file paths and parameters required for the workflow to vqsr_resources.json to avoid having too many function/CL arguments in case people want to change the values or files.

There's also a main() function using argparse in case people want to copy the script and run it from the command line. Not sure if this is neccessary though...

matren395 · 2023-09-07T13:13:13Z

gnomad/variant_qc/vqsr.py

+            out_bucket=out_bucket,
+        )
+
+    # return recalibrated_gathered_vcf_job


does this function not return anything?

Oh I see it now, gather_vcfs() and apply_recalibration() would write to the bucket that you want in the end.

danking · 2023-10-10T16:31:15Z

Hey all, this is the sort of thing I'd love to pull into Hail mainline. Perhaps best to discuss after v4 is released, but I'd like to ask: Are y'all interested in also/alternatively PR'ing this to Hail?

I want to strike a balance of:

People with relevant knowledge (mostly you all) maintain the code (and perhaps update it as y'all's thinking about VQSR changes).
The interface is stable and easy to use for a wide array of Hail users (which I see as either antagonistic or at best extra work for you all; but exactly the work that Hail is in the business of doing).

Additionally, I want to incorporate this into an example VDS combiner pipeline because folks expect a "jointly-called" dataset to include these annotations.

matren395 · 2023-10-10T16:49:50Z

Oh hello, I would both 1) be very very interested in PRing this into Hail! 2) politely would very very much like to hold this until after v4. I believe KC is actually handling post-v4 priorities at the moment, I can definitely add this to the list of things to be discussed post-v4!

…

On Tue, Oct 10, 2023 at 12:31 PM Dan King ***@***.***> wrote: Hey all, this is the sort of thing I'd love to pull into Hail mainline. Perhaps best to discuss after v4 is released, but I'd like to ask: Are y'all interested in also/alternatively PR'ing this to Hail? I want to strike a balance of: 1. People with relevant knowledge (mostly you all) maintain the code (and perhaps update it as y'all's thinking about VQSR changes). 2. The interface is stable and easy to use for a wide array of Hail users (which I see as either antagonistic or at best extra work for you all; but exactly the work that Hail is in the business of doing). Additionally, I want to incorporate this into an example VDS combiner pipeline because folks expect a "jointly-called" dataset to include these annotations. — Reply to this email directly, view it on GitHub <#470 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASXZRYQZ7HVQULKURQSDKZDX6VZ57AVCNFSM54MSKCU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZVGU4DAMJTGQ4Q> . You are receiving this because you were assigned.Message ID: ***@***.***>

konradjk · 2023-10-13T17:07:35Z

I would probably advocate for these two ideas to be independent. We can get this into gnomad_methods, test it out, make sure people can use it, and then when we have more bandwidth, push into Hail where we need to do more robustness testing anyway

danking · 2024-01-29T15:29:55Z

@LindoNkambule Is this still the latest and greatest VQSR batch pipeline?

Hail team just generated wave 2 of BGE and I'd like to deliver it to the analysts with VQSR results rather than without.

And is this how you generate the input VCF? If not, could you share the code you use?

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info = t.info.drop(
    'AS_SB_TABLE', 'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')
hl.methods.export_vcf(dataset = t, output = out, tabix = True)

danking · 2024-01-29T15:33:57Z

And what, in your expert opinion, should we use for these parameters? Are the singletons important for a dataset like BGE which, I presume, doesn't have many trios?

    parser.add_argument('--transmitted-singletons', type=str, required=False)
    parser.add_argument('--sibling-singletons', type=str, required=False)
    parser.add_argument('--no-as-annotations', action='store_true')

LindoNkambule · 2024-01-29T17:58:29Z

Hi @danking, Yes, it is the latest pipeline. Not the greatest because the intervals file (UNPADDED_INTERVALS in vqsr_resources.json) does not have chrM and there were also some regions on chrY that are not covered, so there might be a small drop in the number of variants post-VQSR if the initial input has chrM and/or chrY.

Yes, that is how I usually generate the input VCF.

If there aren't many trios, then I think --transmitted-singletons isn't necessary. @konradjk can you weigh in on this?

danking · 2024-01-29T18:54:10Z

@LindoNkambule is the UNPADDED_INTERVALS meant to be the calling intervals used to generate the source GVCFs? I'm pretty sure we should be able to get from the PMs the calling intervals for our GVCFs.

LindoNkambule · 2024-01-29T19:08:36Z

Yes, I just used the current one since Laura said the intervals were optimized for runtime (approximately even runtime). Now that I think about it, I guess I could write a function that takes the input and creates the interval based on the input file...

danking · 2024-01-29T19:43:24Z

I just slacked Laura, she mentioned:

The handcurated intervals were optimized for GenotypeGVCFs runtime, so probably not comparable for VQSR runtime

I think either generating intervals directly from the VDS or using the original calling intervals should be fine. AFAICT, the only important part is that the intervals in the interval list contain every variant in the VDS/sites-only-VCF.

Do you know if SplitIntervals merely partitions the intervals or will it actually cut intervals up? If the latter is true, then we could just send it

chr1:1-NNN
chr2:1-MMM
...

and have it split automatically. If the number of splits is sufficiently small and the number of variants in the dataset sufficiently large, I don't think there will be too much imbalance. Maybe near the splits containing the telomeres would be a bit light on actual variant data.

konradjk · 2024-01-29T20:03:38Z

I'm not quite sure what the question is, but yes, if you don't really have many trios or sib pairs, the transmitted and sibling singleton stuff won't do much. It is something that should be encouraged to use but certainly shouldn't be required from a code POV

danking · 2024-01-29T20:09:52Z

Don't we usually run VQSR before we've inferred sample relatedness? I don't recall manifest files indicating trios or sib relationships. Is there a usual way for us to deduce that information in preparation for VQSR?

LindoNkambule added 4 commits July 22, 2022 15:31

Initial upload of VQSR workflow

30e8c4c

make improvements

a50c5eb

use preemptible machines for model creating steps

21c1921

add check for overlap between consecutive chunks

c449f9d

jkgoodrich added the Variant QC label May 16, 2023

matren395 reviewed Sep 7, 2023

View reviewed changes

jkgoodrich added the v4 label Sep 10, 2023

jkgoodrich assigned jkgoodrich and matren395 Sep 21, 2023

jkgoodrich removed v4 Variant QC labels Oct 13, 2023

jkgoodrich unassigned matren395 Oct 13, 2023

danking mentioned this pull request Oct 23, 2023

Implement support for VQSR in QoB hail-is/hail#12905

Open

use bcftools to concat chunks and also optimized resource requirements

b014205

ch-kr added the Variant QC label Dec 1, 2023

jkgoodrich marked this pull request as draft December 18, 2024 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial upload of VQSR workflow #470

Initial upload of VQSR workflow #470

LindoNkambule commented Jul 22, 2022

matren395 Sep 7, 2023

matren395 Sep 7, 2023

danking commented Oct 10, 2023

matren395 commented Oct 10, 2023 via email

konradjk commented Oct 13, 2023

danking commented Jan 29, 2024

danking commented Jan 29, 2024

LindoNkambule commented Jan 29, 2024

danking commented Jan 29, 2024

LindoNkambule commented Jan 29, 2024

danking commented Jan 29, 2024

konradjk commented Jan 29, 2024

danking commented Jan 29, 2024

Initial upload of VQSR workflow #470

Are you sure you want to change the base?

Initial upload of VQSR workflow #470

Conversation

LindoNkambule commented Jul 22, 2022

matren395 Sep 7, 2023

Choose a reason for hiding this comment

matren395 Sep 7, 2023

Choose a reason for hiding this comment

danking commented Oct 10, 2023

matren395 commented Oct 10, 2023 via email

konradjk commented Oct 13, 2023

danking commented Jan 29, 2024

danking commented Jan 29, 2024

LindoNkambule commented Jan 29, 2024

danking commented Jan 29, 2024

LindoNkambule commented Jan 29, 2024

danking commented Jan 29, 2024

konradjk commented Jan 29, 2024

danking commented Jan 29, 2024