Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpmax vs European AF split nonsynonymous variant analysis #597

Merged
merged 18 commits into from
Jun 20, 2024

Conversation

mike-w-wilson
Copy link
Contributor

This code generates tables by European grouping as well as the AF threshold. I'm not sold on the location of this script so thoughts welcome!

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one debugging suggestion and some minor comments

Comment on lines 113 to 116
if version == "v4":
grpmax_ga_expr = t_ht.grpmax.gnomad.gen_anc
else:
grpmax_ga_expr = t_ht.popmax[0].pop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly yes because the filter above changes the table so the expr assignment in L108 and L110 points to the old table and breaks hail. There may be another way that my Friday brain isnt thinking of though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I totally forgot -- good call. I guess the alternative is probably something like

            if version == "v4":
                t_ht = t_ht.annotate(grpmax_ga=t_ht.grpmax.gnomad.gen_anc)
            else:
                t_ht = t_ht.annotate(grpmax_ga=t_ht.popmax[0].pop)

            t_ht = t_ht.filter(DIVERSE_GRPS.contains(t_ht.grpmax_ga))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- this is better

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
results_by_eur_grping = {}
# Filter to only non-synonymous terms CSQ_CODING_HIGH_IMPACT +
# CSQ_CODING_MEDIUM_IMPACT
ht = ht.filter(NS_CONSEQ_TERMS.contains(ht.vep.most_severe_consequence))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some print statements to check these filters:

v4 HT:
- Total count = 183717261
- Filtering to non-synonymous keeps 23786562 variants (~13% of all variants)
- Filtering to variant QC pass keeps 16319240 variants (~69% of all non-synonymous variants)

v2 HT:
- Total count = 17209972
- Non-syn = 7849006 (~46%)
- QC pass = 6923424 (~88% non-syn)

The proportion of variants retained in v2 is much higher, which makes me wonder: is there something off about the most_severe_consequence field in the v4 HT? Maybe a helpful debugging step is to rerun after filtering v4 to MANE/canonical transcripts only and v2 to canonical only?

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
@ch-kr
Copy link
Contributor

ch-kr commented Apr 22, 2024

additional thoughts from doing a bit more digging -- we should consider moving start_lost to CSQ_CODING_HIGH_IMPACT and removing splice_region_variant from CSQ_CODING_MEDIUM_IMPACT based on https://grch37.ensembl.org/info/genome/variation/prediction/predicted_data.html.

CSQ_CODING_HIGH_IMPACT = [
    "transcript_ablation",
    "splice_acceptor_variant",
    "splice_donor_variant",
    "stop_gained",
    "frameshift_variant",
    "stop_lost",
]

CSQ_CODING_MEDIUM_IMPACT = [
    "start_lost",  # new in v81
    "initiator_codon_variant",  # deprecated
    "transcript_amplification",
    "inframe_insertion",
    "inframe_deletion",
    "missense_variant",
    "protein_altering_variant",  # new in v79
    "splice_region_variant",
]

Comparing the above to:
https://grch37.ensembl.org/info/genome/variation/prediction/predicted_data.html

  • transcript_ablation: HIGH
  • splice_acceptor_variant: HIGH
  • splice_donor_variant: HIGH
  • stop_gained: HIGH
  • frameshift_variant: HIGH
  • stop_lost: HIGH
  • start_lost: now HIGH (was MEDIUM based on our consequences)
  • transcript_amplification: now HIGH (was MEDIUM)
  • inframe_insertion: MEDIUM
  • inframe_deletion: MEDIUM
  • missense_variant: MEDIUM
  • protein_altering_variant: MEDIUM
  • splice_region_variant: LOW (was MEDIUM)

Other high impact csqs from table:

  • feature_elongation
  • feature_truncation

No other medium impact csqs from table

I also reran without the variant QC filter:
image
image

edit to add I just realized I linked the GRCh37 site earlier; here is the updated link https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html

@ch-kr
Copy link
Contributor

ch-kr commented Apr 22, 2024

posted about the consequences in gnomad_qc: https://atgu.slack.com/archives/CRA2TKTV0/p1713817822392309

@mike-w-wilson mike-w-wilson removed the request for review from KoalaQin May 24, 2024 13:41
@mike-w-wilson mike-w-wilson requested a review from ch-kr May 29, 2024 12:40
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more minor comments (none of which will fix the counts)

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
hl.any(
ht.vep.transcript_consequences.consequence_terms.map(
lambda x: ~hl.literal(
CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I suggested this, but there are fewer csq terms in NS_CONSEQ_TERMS than in CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT, so if these return the same results, using NS_CONSEQ_TERMS to filter is probably cleaner

t_variants = ht.count()
if can_only:
logger.info("Filtering to only MANE Select and canonical transcripts")
ht = ht.explode(ht.vep.transcript_consequences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to filter to canonical transcripts using https://github.com/broadinstitute/gnomad_methods/blob/main/gnomad/utils/vep.py#L393 (something like filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True)) without having to explode here (which would also remove the distinct downstream)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not know we had this, neat!

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more minor comments

"keep only non-synonymous variants..."
)
ht = filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True)
# All MANE select transcripts in v4 are also the canonical transcript
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment that this comment makes a little more sense above the call to filter_vep_to_canonical_transcripts

" of total variants)",
grp_id,
threshold,
version,
t_ht.count(),
t_ht.count() / p_ns_variants * 100,
# t_ht.count() / p_ns_variants * 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# t_ht.count() / p_ns_variants * 100,

ht = ht.filter(
hl.any(
ht.vep.transcript_consequences.consequence_terms.map(
lambda x: hl.literal(NS_CONSEQ_TERMS).contains(x)
Copy link
Contributor

@ch-kr ch-kr May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-reading this and realizing maybe it's better to use filter_vep_transcript_csqs to simultaneously filter to canonical transcripts and to non-synonymous variants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restructured this and actually tried using this function but hit a bug: broadinstitute/gnomad_methods#707

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mike-w-wilson mike-w-wilson requested a review from ch-kr June 3, 2024 18:02
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment but LGTM

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved
@mike-w-wilson mike-w-wilson requested a review from ch-kr June 20, 2024 15:25
@mike-w-wilson
Copy link
Contributor Author

@ch-kr , I updated to use process_consequences as you suggested so just re-requesting review as it chopped a bit of code

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments but LGTM!

Comment on lines 149 to 151
t_ht = t_ht.filter(
hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this code block could likely get moved above for efficiency since we no longer need to log the counts after every filter:

t_ht = filter_to_threshold(
    p_ht, threshold, version=version, eur_filter=eur_filter
)
t_ht = t_ht.filter(
    hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga)
)
t_ht = t_ht.checkpoint(
    f"gs://gnomad-tmp-4day/grpmax_comps_{version}_{grp_id}_{threshold}.ht",
    overwrite=True,
)

)
if args.canonical:
vep_csq_expr = ht.vep.worst_csq_for_variant_canonical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, nice

Comment on lines +220 to +229
v2_dict_index = (
version_dict["v2"][data_subset]
if grpmax_counts and eur_filter
else version_dict["v2"]
)
v4_dict_index = (
version_dict["v4"][data_subset]
if grpmax_counts and eur_filter
else version_dict["v4"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not important, but you could do something like

versions = ["v2, "v4"]
for version in versions:
    version_dict_idx = (
        version_dict[version][data_subset]
        if grpmax_counts and eur_filter
        else version_dict[version]
    )

and then below in the for loop

for threshold in AF_THRESHOLDS:
    table = []
    for grp in grps:
        v2_val, v4_val = version_dict_idx[:]

though this does remove the .get

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didnt love the construction of this piece but youll hit a key error for mid without get

msg = ""
ht = process_consequences(ht, has_polyphen=False)

if csq_terms:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need an if, right? csq_terms should always be defined based on the logic above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔎

)
else:
create_table(
version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only
version_dict, non_syn_only=non_syn_only

also very minor -- just seems a little clearer to not specify either grpmax_counts or eur_filter based on the if/else logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I kept this in so you can run it without the eur filter and still get overall grpmax counts if you want, or if you dont care to deduplicate, you can still get the AF>threshold counts

@mike-w-wilson mike-w-wilson merged commit 2445970 into main Jun 20, 2024
4 checks passed
@mike-w-wilson mike-w-wilson deleted the mw/eur_af_grpmax_analysis branch June 20, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants