Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial upload of VQSR workflow #470

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

LindoNkambule
Copy link

Hi @jkgoodrich , @konradjk, cc @wlu04. Please have a look at the VQSR workflow and review.

I decided to put resources file paths and parameters required for the workflow to vqsr_resources.json to avoid having too many function/CL arguments in case people want to change the values or files.

There's also a main() function using argparse in case people want to copy the script and run it from the command line. Not sure if this is neccessary though...

out_bucket=out_bucket,
)

# return recalibrated_gathered_vcf_job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this function not return anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it now, gather_vcfs() and apply_recalibration() would write to the bucket that you want in the end.

@danking
Copy link
Contributor

danking commented Oct 10, 2023

Hey all, this is the sort of thing I'd love to pull into Hail mainline. Perhaps best to discuss after v4 is released, but I'd like to ask: Are y'all interested in also/alternatively PR'ing this to Hail?

I want to strike a balance of:

  1. People with relevant knowledge (mostly you all) maintain the code (and perhaps update it as y'all's thinking about VQSR changes).
  2. The interface is stable and easy to use for a wide array of Hail users (which I see as either antagonistic or at best extra work for you all; but exactly the work that Hail is in the business of doing).

Additionally, I want to incorporate this into an example VDS combiner pipeline because folks expect a "jointly-called" dataset to include these annotations.

@matren395
Copy link
Contributor

matren395 commented Oct 10, 2023 via email

@konradjk
Copy link
Collaborator

I would probably advocate for these two ideas to be independent. We can get this into gnomad_methods, test it out, make sure people can use it, and then when we have more bandwidth, push into Hail where we need to do more robustness testing anyway

@danking
Copy link
Contributor

danking commented Jan 29, 2024

@LindoNkambule Is this still the latest and greatest VQSR batch pipeline?

Hail team just generated wave 2 of BGE and I'd like to deliver it to the analysts with VQSR results rather than without.

And is this how you generate the input VCF? If not, could you share the code you use?

vds = hl.vds.read_vds(vds_file)
mt = hl.vds.to_dense_mt(vds)
t = gnomad.utils.sparse_mt.default_compute_info(mt)
t = t.annotate(info = t.info.drop(
    'AS_SB_TABLE', 'AS_QUALapprox', 'AS_VarDP', 'AS_SOR', 'AC_raw', 'AC', 'AS_SB'
))
t = t.drop('AS_lowqual')
hl.methods.export_vcf(dataset = t, output = out, tabix = True)

@danking
Copy link
Contributor

danking commented Jan 29, 2024

And what, in your expert opinion, should we use for these parameters? Are the singletons important for a dataset like BGE which, I presume, doesn't have many trios?

    parser.add_argument('--transmitted-singletons', type=str, required=False)
    parser.add_argument('--sibling-singletons', type=str, required=False)
    parser.add_argument('--no-as-annotations', action='store_true')

@LindoNkambule
Copy link
Author

Hi @danking, Yes, it is the latest pipeline. Not the greatest because the intervals file (UNPADDED_INTERVALS in vqsr_resources.json) does not have chrM and there were also some regions on chrY that are not covered, so there might be a small drop in the number of variants post-VQSR if the initial input has chrM and/or chrY.

Yes, that is how I usually generate the input VCF.

If there aren't many trios, then I think --transmitted-singletons isn't necessary. @konradjk can you weigh in on this?

@danking
Copy link
Contributor

danking commented Jan 29, 2024

@LindoNkambule is the UNPADDED_INTERVALS meant to be the calling intervals used to generate the source GVCFs? I'm pretty sure we should be able to get from the PMs the calling intervals for our GVCFs.

@LindoNkambule
Copy link
Author

Yes, I just used the current one since Laura said the intervals were optimized for runtime (approximately even runtime). Now that I think about it, I guess I could write a function that takes the input and creates the interval based on the input file...

@danking
Copy link
Contributor

danking commented Jan 29, 2024

I just slacked Laura, she mentioned:

The handcurated intervals were optimized for GenotypeGVCFs runtime, so probably not comparable for VQSR runtime

I think either generating intervals directly from the VDS or using the original calling intervals should be fine. AFAICT, the only important part is that the intervals in the interval list contain every variant in the VDS/sites-only-VCF.

Do you know if SplitIntervals merely partitions the intervals or will it actually cut intervals up? If the latter is true, then we could just send it

chr1:1-NNN
chr2:1-MMM
...

and have it split automatically. If the number of splits is sufficiently small and the number of variants in the dataset sufficiently large, I don't think there will be too much imbalance. Maybe near the splits containing the telomeres would be a bit light on actual variant data.

@konradjk
Copy link
Collaborator

I'm not quite sure what the question is, but yes, if you don't really have many trios or sib pairs, the transmitted and sibling singleton stuff won't do much. It is something that should be encouraged to use but certainly shouldn't be required from a code POV

@danking
Copy link
Contributor

danking commented Jan 29, 2024

Don't we usually run VQSR before we've inferred sample relatedness? I don't recall manifest files indicating trios or sib relationships. Is there a usual way for us to deduce that information in preparation for VQSR?

@jkgoodrich jkgoodrich marked this pull request as draft December 18, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants