-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve intronic variant description handling with differing Alts #657
Comments
As a startign pont we currently get import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NW_011332691.1(NM_012234.6):c.335+1C>G'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': '))) {
"flag": "warning",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.2.1.dev709+g6340024",
"vvdb_version": "vvdb_2024_8",
"vvseqrepo_db": "VV_SR_2024_09/master",
"vvta_version": "vvta_2024_09"
},
"validation_warning_1": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NW_011332691.1(NM_012234.6):c.335+1C>G",
"transcript_description": "",
"validation_warnings": [
"NW_011332691.1:g.53828G>C: Variant reference (G) does not agree with reference sequence (C)"
],
"variant_exonic_positions": null
}
} So there is code that will handle this already. At least a place to get started I have a feelig the +1 is not being handled as an intron, but as a + 1 position at the moment. I also am unsure of the exon boundary. |
OK, managed to get one through {
"NM_012234.6:c.335+1G>C": {
"alt_genomic_loci": [
{
"grch38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828C>G",
"vcf": {
"alt": "G",
"chr": "HG126_PATCH",
"pos": "53828",
"ref": "C"
}
}
},
{
"hg38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828C>G",
"vcf": {
"alt": "G",
"chr": "chr3_KN538364v1_fix",
"pos": "53828",
"ref": "C"
}
}
}
],
"annotations": {
"chromosome": "3",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:10480",
"ncbigene": "23429",
"select": "RefSeq"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "3p13",
"note": "RING1 and YY1 binding protein",
"refseq_select": true,
"variant": "0"
},
"gene_ids": {
"ccds_ids": [],
"ensembl_gene_id": "ENSG00000163602",
"entrez_gene_id": "23429",
"hgnc_id": "HGNC:10480",
"omim_id": [
"607535"
],
"ucsc_id": "uc003dpe.4"
},
"gene_symbol": "RYBP",
"genome_context_intronic_sequence": "NW_011332691.1(NM_012234.6):c.335+1G>C",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_036366.3:p.?",
"tlr": "NP_036366.3:p.?"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_012234.6:c.335+1G>C",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000003.11:g.72427674=",
"vcf": {
"alt": "G",
"chr": "3",
"pos": "72427674",
"ref": "G"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000003.12:g.72378523=",
"vcf": {
"alt": "G",
"chr": "3",
"pos": "72378523",
"ref": "G"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000003.11:g.72427674=",
"vcf": {
"alt": "G",
"chr": "chr3",
"pos": "72427674",
"ref": "G"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000003.12:g.72378523=",
"vcf": {
"alt": "G",
"chr": "chr3",
"pos": "72378523",
"ref": "G"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_036366.3",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_012234.6"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NW_011332691.1(NM_012234.6):c.335+1G>C",
"transcript_description": "Homo sapiens RING1 and YY1 binding protein (RYBP), mRNA",
"validation_warnings": [
"TranscriptVersionWarning: A more recent version of the selected reference sequence NM_012234.6 is available for genome build GRCh38 (NM_012234.7)"
],
"variant_exonic_positions": {
"NC_000003.12": {
"end_exon": "cannot be calculated",
"start_exon": "cannot be calculated"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.2.1.dev709+g6340024",
"vvdb_version": "vvdb_2024_8",
"vvseqrepo_db": "VV_SR_2024_09/master",
"vvta_version": "vvta_2024_09"
}
} Will see what it does to the tests and then add this test and try more variants. Need to look at the variant exonic piositions too |
@John-F-Wagstaff . The tests now all pass and this is the output above for this variant. We do not process the exonic positions for ALTs currently. Thinking we may add this at a later date. I'll check a few more variant types. Looking at the code, this will only be applicable to intronic variants. This is because, for VV, the NC_(NM_) is totally unnecessary for exonic variants since we map to all available chrs and alts and if possible genes |
Deletions now working import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NW_011332691.1(NM_012234.6):c.335+1del'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': '))) {
"NM_012234.6:c.335+1del": {
"alt_genomic_loci": [
{
"grch38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828del",
"vcf": {
"alt": "A",
"chr": "HG126_PATCH",
"pos": "53827",
"ref": "AC"
}
}
},
{
"hg38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828del",
"vcf": {
"alt": "A",
"chr": "chr3_KN538364v1_fix",
"pos": "53827",
"ref": "AC"
}
}
}
],
"annotations": {
"chromosome": "3",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:10480",
"ncbigene": "23429",
"select": "RefSeq"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "3p13",
"note": "RING1 and YY1 binding protein",
"refseq_select": true,
"variant": "0"
},
"gene_ids": {
"ccds_ids": [],
"ensembl_gene_id": "ENSG00000163602",
"entrez_gene_id": "23429",
"hgnc_id": "HGNC:10480",
"omim_id": [
"607535"
],
"ucsc_id": "uc003dpe.4"
},
"gene_symbol": "RYBP",
"genome_context_intronic_sequence": "NW_011332691.1(NM_012234.6):c.335+1del",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_036366.3:p.?",
"tlr": "NP_036366.3:p.?"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_012234.6:c.335+1del",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000003.11:g.72427675del",
"vcf": {
"alt": "A",
"chr": "3",
"pos": "72427673",
"ref": "AG"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000003.12:g.72378524del",
"vcf": {
"alt": "A",
"chr": "3",
"pos": "72378522",
"ref": "AG"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000003.11:g.72427675del",
"vcf": {
"alt": "A",
"chr": "chr3",
"pos": "72427673",
"ref": "AG"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000003.12:g.72378524del",
"vcf": {
"alt": "A",
"chr": "chr3",
"pos": "72378522",
"ref": "AG"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_036366.3",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_012234.6"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NW_011332691.1(NM_012234.6):c.335+1del",
"transcript_description": "Homo sapiens RING1 and YY1 binding protein (RYBP), mRNA",
"validation_warnings": [
"TranscriptVersionWarning: A more recent version of the selected reference sequence NM_012234.6 is available for genome build GRCh38 (NM_012234.7)"
],
"variant_exonic_positions": {
"NC_000003.12": {
"end_exon": "cannot be calculated",
"start_exon": "cannot be calculated"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.2.1.dev709+g6340024",
"vvdb_version": "vvdb_2024_8",
"vvseqrepo_db": "VV_SR_2024_09/master",
"vvta_version": "vvta_2024_09"
}
} |
delins are working import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NW_011332691.1(NM_012234.6):c.335+1delinsAT'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': '))) {
"NM_012234.6:c.335+1delinsAT": {
"alt_genomic_loci": [
{
"grch38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828delinsAT",
"vcf": {
"alt": "AT",
"chr": "HG126_PATCH",
"pos": "53828",
"ref": "C"
}
}
},
{
"hg38": {
"hgvs_genomic_description": "NW_011332691.1:g.53828delinsAT",
"vcf": {
"alt": "AT",
"chr": "chr3_KN538364v1_fix",
"pos": "53828",
"ref": "C"
}
}
}
],
"annotations": {
"chromosome": "3",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:10480",
"ncbigene": "23429",
"select": "RefSeq"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "3p13",
"note": "RING1 and YY1 binding protein",
"refseq_select": true,
"variant": "0"
},
"gene_ids": {
"ccds_ids": [],
"ensembl_gene_id": "ENSG00000163602",
"entrez_gene_id": "23429",
"hgnc_id": "HGNC:10480",
"omim_id": [
"607535"
],
"ucsc_id": "uc003dpe.4"
},
"gene_symbol": "RYBP",
"genome_context_intronic_sequence": "NW_011332691.1(NM_012234.6):c.335+1delinsAT",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_036366.3:p.?",
"tlr": "NP_036366.3:p.?"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_012234.6:c.335+1delinsAT",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000003.11:g.72427674delinsAT",
"vcf": {
"alt": "AT",
"chr": "3",
"pos": "72427674",
"ref": "G"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000003.12:g.72378523delinsAT",
"vcf": {
"alt": "AT",
"chr": "3",
"pos": "72378523",
"ref": "G"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000003.11:g.72427674delinsAT",
"vcf": {
"alt": "AT",
"chr": "chr3",
"pos": "72427674",
"ref": "G"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000003.12:g.72378523delinsAT",
"vcf": {
"alt": "AT",
"chr": "chr3",
"pos": "72378523",
"ref": "G"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_036366.3",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_012234.6"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NW_011332691.1(NM_012234.6):c.335+1delinsAT",
"transcript_description": "Homo sapiens RING1 and YY1 binding protein (RYBP), mRNA",
"validation_warnings": [
"TranscriptVersionWarning: A more recent version of the selected reference sequence NM_012234.6 is available for genome build GRCh38 (NM_012234.7)"
],
"variant_exonic_positions": {
"NC_000003.12": {
"end_exon": "cannot be calculated",
"start_exon": "cannot be calculated"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.2.1.dev709+g6340024",
"vvdb_version": "vvdb_2024_8",
"vvseqrepo_db": "VV_SR_2024_09/master",
"vvta_version": "vvta_2024_09"
}
} |
inv is working import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NW_011332691.1(NM_012234.6):c.335+1_335+6inv'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': '))) {
"NM_012234.6:c.335+1_335+6inv": {
"alt_genomic_loci": [
{
"grch38": {
"hgvs_genomic_description": "NW_011332691.1:g.53823_53828inv",
"vcf": {
"alt": "GTAAGT",
"chr": "HG126_PATCH",
"pos": "53823",
"ref": "ACTTAC"
}
}
},
{
"hg38": {
"hgvs_genomic_description": "NW_011332691.1:g.53823_53828inv",
"vcf": {
"alt": "GTAAGT",
"chr": "chr3_KN538364v1_fix",
"pos": "53823",
"ref": "ACTTAC"
}
}
}
],
"annotations": {
"chromosome": "3",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:10480",
"ncbigene": "23429",
"select": "RefSeq"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "3p13",
"note": "RING1 and YY1 binding protein",
"refseq_select": true,
"variant": "0"
},
"gene_ids": {
"ccds_ids": [],
"ensembl_gene_id": "ENSG00000163602",
"entrez_gene_id": "23429",
"hgnc_id": "HGNC:10480",
"omim_id": [
"607535"
],
"ucsc_id": "uc003dpe.4"
},
"gene_symbol": "RYBP",
"genome_context_intronic_sequence": "NW_011332691.1(NM_012234.6):c.335+1_335+6inv",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_036366.3:p.?",
"tlr": "NP_036366.3:p.?"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_012234.6:c.335+1_335+6inv",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000003.11:g.72427669_72427674inv",
"vcf": {
"alt": "CTCATC",
"chr": "3",
"pos": "72427669",
"ref": "GATGAG"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000003.12:g.72378518_72378523inv",
"vcf": {
"alt": "CTCATC",
"chr": "3",
"pos": "72378518",
"ref": "GATGAG"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000003.11:g.72427669_72427674inv",
"vcf": {
"alt": "CTCATC",
"chr": "chr3",
"pos": "72427669",
"ref": "GATGAG"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000003.12:g.72378518_72378523inv",
"vcf": {
"alt": "CTCATC",
"chr": "chr3",
"pos": "72378518",
"ref": "GATGAG"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_036366.3",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_012234.6"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NW_011332691.1(NM_012234.6):c.335+1_335+6inv",
"transcript_description": "Homo sapiens RING1 and YY1 binding protein (RYBP), mRNA",
"validation_warnings": [
"TranscriptVersionWarning: A more recent version of the selected reference sequence NM_012234.6 is available for genome build GRCh38 (NM_012234.7)"
],
"variant_exonic_positions": {
"NC_000003.12": {
"end_exon": "cannot be calculated",
"start_exon": "cannot be calculated"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.2.1.dev709+g6340024",
"vvdb_version": "vvdb_2024_8",
"vvseqrepo_db": "VV_SR_2024_09/master",
"vvta_version": "vvta_2024_09"
}
} |
…nate alignments in patches vs the primary assembly. Issue #657
Is your feature request related to a problem? Please describe.
We end up prioritising the main genomic chromosome for annotation purposes, even for RNA/cDNA sequnces that have been paired with mapping. This will cause problems when exon positions or intronic sequence differs between the main genome's chromosomes and alt sequences, whether due to large scale (not patch compatible) fixes or haplotypes. This has impacted annotation of RYBP, even with the latest mappings, but likely affects other genes too.
Describe the solution you'd like
When a specific target reference genomic sequence is included in the description for a RNA/cDNA transcript we need to be able to direct any annotation checks to the specified reference, not the main chromosomes for the chosen genome. We also want to be able to reduce the outputted data to match the input specifier.
Describe alternatives you've considered
Reducing the outputted data, to match the input specifier, is probably a lower priority than not warning users when given valid input, and might be left to a later lower priority issue?
Additional context
We may also want to be able to handle differing intron-exon boundaries depending on the target ref even when the target ref is unspecified, where we currently fail if the main chromosome set is present and does not match, and don't warn if there are differences either. i.e. Return some warning like 'the intron-exon boundary you have specified for this transcript <tx_id> differs between genomic mappings and the given matches , but not , you will want to describe this variant with a genomic reference specifier like "<tx_id>():<pos_edit>" instead of "" to avoid ambiguity'. The most main chromosome like matching reference version of the correction might also be put into the output as a correction?
The text was updated successfully, but these errors were encountered: