-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial upload of VQSR workflow #470
base: main
Are you sure you want to change the base?
Conversation
out_bucket=out_bucket, | ||
) | ||
|
||
# return recalibrated_gathered_vcf_job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this function not return anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see it now, gather_vcfs() and apply_recalibration() would write to the bucket that you want in the end.
Hey all, this is the sort of thing I'd love to pull into Hail mainline. Perhaps best to discuss after v4 is released, but I'd like to ask: Are y'all interested in also/alternatively PR'ing this to Hail? I want to strike a balance of:
Additionally, I want to incorporate this into an example VDS combiner pipeline because folks expect a "jointly-called" dataset to include these annotations. |
Oh hello,
I would both 1) be very very interested in PRing this into Hail! 2)
politely would very very much like to hold this until after v4. I
believe KC is actually handling post-v4 priorities at the moment, I can
definitely add this to the list of things to be discussed post-v4!
…On Tue, Oct 10, 2023 at 12:31 PM Dan King ***@***.***> wrote:
Hey all, this is the sort of thing I'd love to pull into Hail mainline.
Perhaps best to discuss after v4 is released, but I'd like to ask: Are
y'all interested in also/alternatively PR'ing this to Hail?
I want to strike a balance of:
1. People with relevant knowledge (mostly you all) maintain the code
(and perhaps update it as y'all's thinking about VQSR changes).
2. The interface is stable and easy to use for a wide array of Hail
users (which I see as either antagonistic or at best extra work for you
all; but exactly the work that Hail is in the business of doing).
Additionally, I want to incorporate this into an example VDS combiner
pipeline because folks expect a "jointly-called" dataset to include these
annotations.
—
Reply to this email directly, view it on GitHub
<#470 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASXZRYQZ7HVQULKURQSDKZDX6VZ57AVCNFSM54MSKCU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZVGU4DAMJTGQ4Q>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
I would probably advocate for these two ideas to be independent. We can get this into gnomad_methods, test it out, make sure people can use it, and then when we have more bandwidth, push into Hail where we need to do more robustness testing anyway |
@LindoNkambule Is this still the latest and greatest VQSR batch pipeline? Hail team just generated wave 2 of BGE and I'd like to deliver it to the analysts with VQSR results rather than without. And is this how you generate the input VCF? If not, could you share the code you use?
|
And what, in your expert opinion, should we use for these parameters? Are the singletons important for a dataset like BGE which, I presume, doesn't have many trios?
|
Hi @danking, Yes, it is the latest pipeline. Not the greatest because the intervals file ( Yes, that is how I usually generate the input VCF. If there aren't many trios, then I think |
@LindoNkambule is the |
Yes, I just used the current one since Laura said the intervals were optimized for runtime (approximately even runtime). Now that I think about it, I guess I could write a function that takes the input and creates the interval based on the input file... |
I just slacked Laura, she mentioned:
I think either generating intervals directly from the VDS or using the original calling intervals should be fine. AFAICT, the only important part is that the intervals in the interval list contain every variant in the VDS/sites-only-VCF. Do you know if SplitIntervals merely partitions the intervals or will it actually cut intervals up? If the latter is true, then we could just send it
and have it split automatically. If the number of splits is sufficiently small and the number of variants in the dataset sufficiently large, I don't think there will be too much imbalance. Maybe near the splits containing the telomeres would be a bit light on actual variant data. |
I'm not quite sure what the question is, but yes, if you don't really have many trios or sib pairs, the transmitted and sibling singleton stuff won't do much. It is something that should be encouraged to use but certainly shouldn't be required from a code POV |
Don't we usually run VQSR before we've inferred sample relatedness? I don't recall manifest files indicating trios or sib relationships. Is there a usual way for us to deduce that information in preparation for VQSR? |
Hi @jkgoodrich , @konradjk, cc @wlu04. Please have a look at the VQSR workflow and review.
I decided to put resources file paths and parameters required for the workflow to
vqsr_resources.json
to avoid having too many function/CL arguments in case people want to change the values or files.There's also a
main()
function usingargparse
in case people want to copy the script and run it from the command line. Not sure if this is neccessary though...