-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpmax vs European AF split nonsynonymous variant analysis #597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one debugging suggestion and some minor comments
if version == "v4": | ||
grpmax_ga_expr = t_ht.grpmax.gnomad.gen_anc | ||
else: | ||
grpmax_ga_expr = t_ht.popmax[0].pop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly yes because the filter above changes the table so the expr assignment in L108 and L110 points to the old table and breaks hail. There may be another way that my Friday brain isnt thinking of though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I totally forgot -- good call. I guess the alternative is probably something like
if version == "v4":
t_ht = t_ht.annotate(grpmax_ga=t_ht.grpmax.gnomad.gen_anc)
else:
t_ht = t_ht.annotate(grpmax_ga=t_ht.popmax[0].pop)
t_ht = t_ht.filter(DIVERSE_GRPS.contains(t_ht.grpmax_ga))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- this is better
results_by_eur_grping = {} | ||
# Filter to only non-synonymous terms CSQ_CODING_HIGH_IMPACT + | ||
# CSQ_CODING_MEDIUM_IMPACT | ||
ht = ht.filter(NS_CONSEQ_TERMS.contains(ht.vep.most_severe_consequence)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some print statements to check these filters:
v4 HT:
- Total count = 183717261
- Filtering to non-synonymous keeps 23786562 variants (~13% of all variants)
- Filtering to variant QC pass keeps 16319240 variants (~69% of all non-synonymous variants)
v2 HT:
- Total count = 17209972
- Non-syn = 7849006 (~46%)
- QC pass = 6923424 (~88% non-syn)
The proportion of variants retained in v2 is much higher, which makes me wonder: is there something off about the most_severe_consequence
field in the v4 HT? Maybe a helpful debugging step is to rerun after filtering v4 to MANE/canonical transcripts only and v2 to canonical only?
additional thoughts from doing a bit more digging -- we should consider moving
Comparing the above to:
Other high impact csqs from table:
No other medium impact csqs from table I also reran without the variant QC filter: edit to add I just realized I linked the GRCh37 site earlier; here is the updated link https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html |
posted about the consequences in gnomad_qc: https://atgu.slack.com/archives/CRA2TKTV0/p1713817822392309 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more minor comments (none of which will fix the counts)
hl.any( | ||
ht.vep.transcript_consequences.consequence_terms.map( | ||
lambda x: ~hl.literal( | ||
CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I suggested this, but there are fewer csq terms in NS_CONSEQ_TERMS
than in CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT
, so if these return the same results, using NS_CONSEQ_TERMS
to filter is probably cleaner
t_variants = ht.count() | ||
if can_only: | ||
logger.info("Filtering to only MANE Select and canonical transcripts") | ||
ht = ht.explode(ht.vep.transcript_consequences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to filter to canonical transcripts using https://github.com/broadinstitute/gnomad_methods/blob/main/gnomad/utils/vep.py#L393 (something like filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True)
) without having to explode here (which would also remove the distinct
downstream)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not know we had this, neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more minor comments
"keep only non-synonymous variants..." | ||
) | ||
ht = filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True) | ||
# All MANE select transcripts in v4 are also the canonical transcript |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment that this comment makes a little more sense above the call to filter_vep_to_canonical_transcripts
" of total variants)", | ||
grp_id, | ||
threshold, | ||
version, | ||
t_ht.count(), | ||
t_ht.count() / p_ns_variants * 100, | ||
# t_ht.count() / p_ns_variants * 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# t_ht.count() / p_ns_variants * 100, |
ht = ht.filter( | ||
hl.any( | ||
ht.vep.transcript_consequences.consequence_terms.map( | ||
lambda x: hl.literal(NS_CONSEQ_TERMS).contains(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-reading this and realizing maybe it's better to use filter_vep_transcript_csqs
to simultaneously filter to canonical transcripts and to non-synonymous variants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restructured this and actually tried using this function but hit a bug: broadinstitute/gnomad_methods#707
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh. I guess this function needs to get run first? https://github.com/broadinstitute/gnomad_methods/blob/019865838f993841a540e0b29d8d2f3b1333b1b8/gnomad/utils/vep.py#L273
…variants by aggregating over all gen anc grp AFs instead of just grpmax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment but LGTM
@ch-kr , I updated to use process_consequences as you suggested so just re-requesting review as it chopped a bit of code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments but LGTM!
t_ht = t_ht.filter( | ||
hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this code block could likely get moved above for efficiency since we no longer need to log the counts after every filter:
t_ht = filter_to_threshold(
p_ht, threshold, version=version, eur_filter=eur_filter
)
t_ht = t_ht.filter(
hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga)
)
t_ht = t_ht.checkpoint(
f"gs://gnomad-tmp-4day/grpmax_comps_{version}_{grp_id}_{threshold}.ht",
overwrite=True,
)
) | ||
if args.canonical: | ||
vep_csq_expr = ht.vep.worst_csq_for_variant_canonical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, nice
v2_dict_index = ( | ||
version_dict["v2"][data_subset] | ||
if grpmax_counts and eur_filter | ||
else version_dict["v2"] | ||
) | ||
v4_dict_index = ( | ||
version_dict["v4"][data_subset] | ||
if grpmax_counts and eur_filter | ||
else version_dict["v4"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not important, but you could do something like
versions = ["v2, "v4"]
for version in versions:
version_dict_idx = (
version_dict[version][data_subset]
if grpmax_counts and eur_filter
else version_dict[version]
)
and then below in the for loop
for threshold in AF_THRESHOLDS:
table = []
for grp in grps:
v2_val, v4_val = version_dict_idx[:]
though this does remove the .get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didnt love the construction of this piece but youll hit a key error for mid without get
msg = "" | ||
ht = process_consequences(ht, has_polyphen=False) | ||
|
||
if csq_terms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't need an if, right? csq_terms
should always be defined based on the logic above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔎
) | ||
else: | ||
create_table( | ||
version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only | |
version_dict, non_syn_only=non_syn_only |
also very minor -- just seems a little clearer to not specify either grpmax_counts
or eur_filter
based on the if/else logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I kept this in so you can run it without the eur filter and still get overall grpmax counts if you want, or if you dont care to deduplicate, you can still get the AF>threshold counts
This code generates tables by European grouping as well as the AF threshold. I'm not sold on the location of this script so thoughts welcome!