grpmax vs European AF split nonsynonymous variant analysis #597

mike-w-wilson · 2024-04-10T18:05:24Z

This code generates tables by European grouping as well as the AF threshold. I'm not sold on the location of this script so thoughts welcome!

ch-kr

one debugging suggestion and some minor comments

ch-kr · 2024-04-19T13:31:09Z

gnomad_qc/v4/analyses/grpmax_comps.py

+            if version == "v4":
+                grpmax_ga_expr = t_ht.grpmax.gnomad.gen_anc
+            else:
+                grpmax_ga_expr = t_ht.popmax[0].pop


do you need this twice?

Sadly yes because the filter above changes the table so the expr assignment in L108 and L110 points to the old table and breaks hail. There may be another way that my Friday brain isnt thinking of though...

oh yeah I totally forgot -- good call. I guess the alternative is probably something like

if version == "v4": t_ht = t_ht.annotate(grpmax_ga=t_ht.grpmax.gnomad.gen_anc) else: t_ht = t_ht.annotate(grpmax_ga=t_ht.popmax[0].pop) t_ht = t_ht.filter(DIVERSE_GRPS.contains(t_ht.grpmax_ga))

Yes -- this is better

gnomad_qc/v4/analyses/grpmax_comps.py

ch-kr · 2024-04-22T17:44:41Z

gnomad_qc/v4/analyses/grpmax_comps.py

+    results_by_eur_grping = {}
+    # Filter to only non-synonymous terms CSQ_CODING_HIGH_IMPACT +
+    # CSQ_CODING_MEDIUM_IMPACT
+    ht = ht.filter(NS_CONSEQ_TERMS.contains(ht.vep.most_severe_consequence))


I added some print statements to check these filters:

v4 HT: - Total count = 183717261 - Filtering to non-synonymous keeps 23786562 variants (~13% of all variants) - Filtering to variant QC pass keeps 16319240 variants (~69% of all non-synonymous variants) v2 HT: - Total count = 17209972 - Non-syn = 7849006 (~46%) - QC pass = 6923424 (~88% non-syn)

The proportion of variants retained in v2 is much higher, which makes me wonder: is there something off about the most_severe_consequence field in the v4 HT? Maybe a helpful debugging step is to rerun after filtering v4 to MANE/canonical transcripts only and v2 to canonical only?

gnomad_qc/v4/analyses/grpmax_comps.py

ch-kr · 2024-04-22T18:36:27Z

additional thoughts from doing a bit more digging -- we should consider moving start_lost to CSQ_CODING_HIGH_IMPACT and removing splice_region_variant from CSQ_CODING_MEDIUM_IMPACT based on https://grch37.ensembl.org/info/genome/variation/prediction/predicted_data.html.

CSQ_CODING_HIGH_IMPACT = [
    "transcript_ablation",
    "splice_acceptor_variant",
    "splice_donor_variant",
    "stop_gained",
    "frameshift_variant",
    "stop_lost",
]

CSQ_CODING_MEDIUM_IMPACT = [
    "start_lost",  # new in v81
    "initiator_codon_variant",  # deprecated
    "transcript_amplification",
    "inframe_insertion",
    "inframe_deletion",
    "missense_variant",
    "protein_altering_variant",  # new in v79
    "splice_region_variant",
]

Comparing the above to:
https://grch37.ensembl.org/info/genome/variation/prediction/predicted_data.html

transcript_ablation: HIGH
splice_acceptor_variant: HIGH
splice_donor_variant: HIGH
stop_gained: HIGH
frameshift_variant: HIGH
stop_lost: HIGH
start_lost: now HIGH (was MEDIUM based on our consequences)
transcript_amplification: now HIGH (was MEDIUM)
inframe_insertion: MEDIUM
inframe_deletion: MEDIUM
missense_variant: MEDIUM
protein_altering_variant: MEDIUM
splice_region_variant: LOW (was MEDIUM)

Other high impact csqs from table:

feature_elongation
feature_truncation

No other medium impact csqs from table

I also reran without the variant QC filter:

edit to add I just realized I linked the GRCh37 site earlier; here is the updated link https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html

ch-kr · 2024-04-22T20:30:45Z

posted about the consequences in gnomad_qc: https://atgu.slack.com/archives/CRA2TKTV0/p1713817822392309

ch-kr

more minor comments (none of which will fix the counts)

gnomad_qc/v4/analyses/grpmax_comps.py

ch-kr · 2024-05-29T13:29:56Z

gnomad_qc/v4/analyses/grpmax_comps.py

+            hl.any(
+                ht.vep.transcript_consequences.consequence_terms.map(
+                    lambda x: ~hl.literal(
+                        CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT


I know I suggested this, but there are fewer csq terms in NS_CONSEQ_TERMS than in CSQ_NON_CODING + CSQ_CODING_LOW_IMPACT, so if these return the same results, using NS_CONSEQ_TERMS to filter is probably cleaner

ch-kr · 2024-05-29T14:16:00Z

gnomad_qc/v4/analyses/grpmax_comps.py

+    t_variants = ht.count()
+    if can_only:
+        logger.info("Filtering to only MANE Select and canonical transcripts")
+        ht = ht.explode(ht.vep.transcript_consequences)


you should be able to filter to canonical transcripts using https://github.com/broadinstitute/gnomad_methods/blob/main/gnomad/utils/vep.py#L393 (something like filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True)) without having to explode here (which would also remove the distinct downstream)

Did not know we had this, neat!

gnomad_qc/v4/analyses/grpmax_comps.py

Co-authored-by: Katherine Chao <[email protected]>

ch-kr

more minor comments

ch-kr · 2024-05-30T20:00:40Z

gnomad_qc/v4/analyses/grpmax_comps.py

-            "keep only non-synonymous variants..."
-        )
+        ht = filter_vep_to_canonical_transcripts(ht, filter_empty_csq=True)
+        # All MANE select transcripts in v4 are also the canonical transcript


minor comment that this comment makes a little more sense above the call to filter_vep_to_canonical_transcripts

ch-kr · 2024-05-30T20:02:45Z

gnomad_qc/v4/analyses/grpmax_comps.py

                " of total variants)",
                grp_id,
                threshold,
                version,
                t_ht.count(),
-                t_ht.count() / p_ns_variants * 100,
+                # t_ht.count() / p_ns_variants * 100,


Suggested change

# t_ht.count() / p_ns_variants * 100,

ch-kr · 2024-05-30T20:05:59Z

gnomad_qc/v4/analyses/grpmax_comps.py

+        ht = ht.filter(
+            hl.any(
+                ht.vep.transcript_consequences.consequence_terms.map(
+                    lambda x: hl.literal(NS_CONSEQ_TERMS).contains(x)


re-reading this and realizing maybe it's better to use filter_vep_transcript_csqs to simultaneously filter to canonical transcripts and to non-synonymous variants

I restructured this and actually tried using this function but hit a bug: broadinstitute/gnomad_methods#707

ooh. I guess this function needs to get run first? https://github.com/broadinstitute/gnomad_methods/blob/019865838f993841a540e0b29d8d2f3b1333b1b8/gnomad/utils/vep.py#L273

…variants by aggregating over all gen anc grp AFs instead of just grpmax

ch-kr

one minor comment but LGTM

gnomad_qc/v4/analyses/grpmax_comps.py

Co-authored-by: Katherine Chao <[email protected]>

mike-w-wilson · 2024-06-20T15:25:47Z

@ch-kr , I updated to use process_consequences as you suggested so just re-requesting review as it chopped a bit of code

ch-kr

some minor comments but LGTM!

ch-kr · 2024-06-20T15:39:21Z

gnomad_qc/v4/analyses/grpmax_comps.py

+                    t_ht = t_ht.filter(
+                        hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga)
+                    )


I just realized that this code block could likely get moved above for efficiency since we no longer need to log the counts after every filter:

t_ht = filter_to_threshold( p_ht, threshold, version=version, eur_filter=eur_filter ) t_ht = t_ht.filter( hl.literal(DIVERSE_GRPS).contains(t_ht.grpmax_ga) ) t_ht = t_ht.checkpoint( f"gs://gnomad-tmp-4day/grpmax_comps_{version}_{grp_id}_{threshold}.ht", overwrite=True, )

ch-kr · 2024-06-20T15:45:15Z

gnomad_qc/v4/analyses/grpmax_comps.py

            )
+        if args.canonical:
+            vep_csq_expr = ht.vep.worst_csq_for_variant_canonical


ch-kr · 2024-06-20T16:03:19Z

gnomad_qc/v4/analyses/grpmax_comps.py

+    v2_dict_index = (
+        version_dict["v2"][data_subset]
+        if grpmax_counts and eur_filter
+        else version_dict["v2"]
+    )
+    v4_dict_index = (
+        version_dict["v4"][data_subset]
+        if grpmax_counts and eur_filter
+        else version_dict["v4"]
+    )


not important, but you could do something like

versions = ["v2, "v4"] for version in versions: version_dict_idx = ( version_dict[version][data_subset] if grpmax_counts and eur_filter else version_dict[version] )

and then below in the for loop

for threshold in AF_THRESHOLDS: table = [] for grp in grps: v2_val, v4_val = version_dict_idx[:]

though this does remove the .get

Yeah I didnt love the construction of this piece but youll hit a key error for mid without get

ch-kr · 2024-06-20T16:06:54Z

gnomad_qc/v4/analyses/grpmax_comps.py

+        msg = ""
+        ht = process_consequences(ht, has_polyphen=False)
+
+        if csq_terms:


this doesn't need an if, right? csq_terms should always be defined based on the logic above

ch-kr · 2024-06-20T16:08:35Z

gnomad_qc/v4/analyses/grpmax_comps.py

+            )
+    else:
+        create_table(
+            version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only


Suggested change

version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only

version_dict, non_syn_only=non_syn_only

also very minor -- just seems a little clearer to not specify either grpmax_counts or eur_filter based on the if/else logic

So I kept this in so you can run it without the eur filter and still get overall grpmax counts if you want, or if you dont care to deduplicate, you can still get the AF>threshold counts

mike-w-wilson added 3 commits April 9, 2024 15:30

Add script for eur AF grpmax analyses

ee3f983

Add table print out

2d1494c

Adjust output table formatting

766d541

mike-w-wilson added the v4.1 label Apr 10, 2024

mike-w-wilson requested a review from ch-kr April 10, 2024 18:05

mike-w-wilson assigned mike-w-wilson, ch-kr and KoalaQin Apr 10, 2024

mike-w-wilson requested a review from KoalaQin April 12, 2024 13:26

ch-kr reviewed Apr 22, 2024

View reviewed changes

mike-w-wilson unassigned KoalaQin May 21, 2024

mike-w-wilson removed the request for review from KoalaQin May 24, 2024 13:41

mike-w-wilson added 3 commits May 24, 2024 10:47

Add --canonical-only and drop nfe grp AF defined filter

398f282

Add more loggers and update AF logic

cef0eba

Update filters filtering with optional hard filters filter...filter

622ddc4

mike-w-wilson requested a review from ch-kr May 29, 2024 12:40

ch-kr reviewed May 29, 2024

View reviewed changes

mike-w-wilson and others added 4 commits May 29, 2024 11:24

Apply suggestions from code review

2389ca9

Co-authored-by: Katherine Chao <[email protected]>

Addressing PR feedback on filtering to canonical and ga_grp ann

d4ccfa5

Add remaining to agg groups

63dae9d

Add all variant option

f572821

ch-kr reviewed May 30, 2024

View reviewed changes

mike-w-wilson added 2 commits June 3, 2024 11:44

Reorder code to be more readable and allow for duplicate counting of …

403df5f

…variants by aggregating over all gen anc grp AFs instead of just grpmax

Fix table logger by casting to str

093f774

mike-w-wilson requested a review from ch-kr June 3, 2024 18:02

ch-kr approved these changes Jun 3, 2024

View reviewed changes

gnomad_qc/v4/analyses/grpmax_comps.py Outdated Show resolved Hide resolved

Add csq term arg and make eur filter optional for grpmax

b3f4b28

mike-w-wilson and others added 4 commits June 11, 2024 11:56

Add v2 liftover option for comp

60c0510

Update gnomad_qc/v4/analyses/grpmax_comps.py

2fc6627

Co-authored-by: Katherine Chao <[email protected]>

Update to process conseuqences for canonical filtering

466c25a

Update to use process_consequence output fields for canonical logic

8ff42ee

mike-w-wilson requested a review from ch-kr June 20, 2024 15:25

ch-kr approved these changes Jun 20, 2024

View reviewed changes

Move grp filtering before counts

c831c42

mike-w-wilson merged commit 2445970 into main Jun 20, 2024
4 checks passed

mike-w-wilson deleted the mw/eur_af_grpmax_analysis branch June 20, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpmax vs European AF split nonsynonymous variant analysis #597

grpmax vs European AF split nonsynonymous variant analysis #597

mike-w-wilson commented Apr 10, 2024

ch-kr left a comment

ch-kr Apr 19, 2024

mike-w-wilson May 24, 2024

ch-kr May 29, 2024

mike-w-wilson May 29, 2024

ch-kr Apr 22, 2024

ch-kr commented Apr 22, 2024 •

edited

Loading

ch-kr commented Apr 22, 2024

ch-kr left a comment

ch-kr May 29, 2024

ch-kr May 29, 2024

mike-w-wilson May 29, 2024

ch-kr left a comment

ch-kr May 30, 2024

ch-kr May 30, 2024

ch-kr May 30, 2024 •

edited

Loading

mike-w-wilson Jun 3, 2024

ch-kr Jun 3, 2024

ch-kr left a comment

mike-w-wilson commented Jun 20, 2024

ch-kr left a comment

ch-kr Jun 20, 2024

ch-kr Jun 20, 2024

ch-kr Jun 20, 2024

mike-w-wilson Jun 20, 2024

ch-kr Jun 20, 2024

mike-w-wilson Jun 20, 2024

ch-kr Jun 20, 2024

mike-w-wilson Jun 20, 2024

	version_dict, grpmax_counts=grpmax_counts, non_syn_only=non_syn_only
	version_dict, non_syn_only=non_syn_only

grpmax vs European AF split nonsynonymous variant analysis #597

grpmax vs European AF split nonsynonymous variant analysis #597

Conversation

mike-w-wilson commented Apr 10, 2024

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr commented Apr 22, 2024 • edited Loading

ch-kr commented Apr 22, 2024

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

mike-w-wilson commented Jun 20, 2024

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr commented Apr 22, 2024 •

edited

Loading

ch-kr May 30, 2024 •

edited

Loading