-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOVD endpoint: Variants crossing gene boundaries generate "porcessing_error". #173
Comments
Hi @ifokkema I have corrected the typo. Will push up shortly. I need to try and figure out why VV identifies the transcript. I suspect that normalised in one direction the deletion can be pushed entirely into the transcript reference sequence. Want to understand what's going on first. As for the coordinates outside of transcript reference sequences, the Working Group cannot reach a conclusion as to whether to allow it or the nomenclature standard to support it. Perhaps a position relative to the c.1 could be provided (although I'm not sure whether this is possible) which you can then choose to format into whatever you like???? :) |
Thank you!
Sounds good! I don't mind breaking HGVS rules by myself, as long as the working group cannot come up with a solution. It seems obvious to me that in some way we should be able to indicate the variant affects the gene, and I don't mind constructing the non-HGVS format that Mutalyzer currently uses to create some support. I could parse the positions out of an error message if I have to, otherwise, separate fields would be awesome. |
I'd go for additional JSON fields. Adding fields generally doesn't break stuff for others and is a way of creating non-standard data in a meaningful way. Gonna prioratise a few other issues first though??? |
Great, thank you! And no worries, there's plenty left for me to work on 😬 |
@leicray and me also discussed this today over Skype;
These are the downsides that I see of simply not supporting variants partially falling outside of a gene:
I argue the downsides of having no support outweigh not having an answer to the question "how far from the gene should we support NC-based positions?". Therefore, after discussing this with @leicray, I suggest using the uncertain positions notation for this; for instance, as seen in Pros:
Perhaps I should shoot this into the HGVS committee, but I'm not part of those email conversations. |
…to a transcript. - If variants partially map outside of a transcript, VV currently does not return a transcript mapping, but also no liftover. As such, we can do nothing. - See openvar/variantValidator#173. - Silently skipping these variants for now.
Although the original example no longer throws an error, |
That's odd. Wonder if I forgot to upload a version/merge some changes. I'll take a look ASAP Sorry for the delays @ifokkema . Having to do some of the boring stuff currently!!!!! |
Thanks and, no worries! I'm running my VV verification script now on the GV shared LOVD, so I might run into more weird things... |
Another example, but strangely enough, Mutalyzer indicates |
I suspect that this might be a gene containing an alignment gap because Alamut Visual does not show any alignments against RefSeq transcripts. What's more, the genome and transcript position numbers do not align. Perhaps easier to SHOW you this than to try an explain. How about a skype chat tomorrow? |
I have just also noticed that the error message reads:
instead of
|
Normally, VV would throw a warning for this. Now, the Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.
Hmm... yeah, this one I don't understand. I'm free tomorrow at any time, from 8:30 UK time. Just let me know ~30 minutes in advance, if possible.
Yes, indeed; Pete already fixed it, but it somehow this change didn't find its way onto the server. |
Only if the position queried hits the gap.
Yep, pretty sure it's fixed. Can't replicate locally :) |
HGVS does not allow for any positions outside of the transcript. Mutalyzer is, as usual, non compliant with this rule. That being said, it is possible our alignment for NM_017940.4 might be inaccurate. The update UTA will fix this. I'll check |
For {
"NC_000001.10:g.1573181C>G": {
"errors": [],
**"flag": "processing_error",**
"NC_000001.10:g.1573181C>G": {
"p_vcf": "1:1573181:C:G",
"g_hgvs": "NC_000001.10:g.1573181C>G",
"selected_build": "GRCh37",
"genomic_variant_error": null,
"hgvs_t_and_p": {
"NM_033488.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
},
"NM_033489.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
},
"NM_033486.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
},
"NM_033493.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
},
"NM_033492.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
},
"NM_033487.1": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
"vcf": {
"chr": "chr1",
"pos": "1573181",
"ref": "C",
"alt": "G"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
"vcf": {
"chr": "chr1",
"pos": "1637819",
"ref": "C",
"alt": "G"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
}
}
}
},
"metadata": {
"variantvalidator_version": "1.0.4.dev11+g97aec97",
"variantvalidator_hgvs_version": "1.2.5.vv1",
"uta_schema": "uta_20180821",
"seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
"variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
}
} CDK11B is indeed listed as a gapped alignment The alignment is a total mess. According to UCSC
Therefore the alignment provided by Mutalyzer will be absolute pants!!! :) That being said, our alignment needs looking at but only once we have re-created UTA |
Hmm... So perhaps then the method that checks which transcripts should be aligned to, thinks the variant aligns, but then the actual alignment algorithm finds they don't align... Something like that?
Uff... haha, good to know 😉 |
For VV gives {
"flag": "intergenic",
"intergenic_variant_1": {
"alt_genomic_loci": [],
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
"vcf": {
"alt": "A",
"chr": "1",
"pos": "16891340",
"ref": "T"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
"vcf": {
"alt": "A",
"chr": "1",
"pos": "16564845",
"ref": "T"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
"vcf": {
"alt": "A",
"chr": "chr1",
"pos": "16891340",
"ref": "T"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
"vcf": {
"alt": "A",
"chr": "chr1",
"pos": "16564845",
"ref": "T"
}
}
},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh37",
"submitted_variant": "NC_000001.10:g.16891340T>A",
"transcript_description": "",
"validation_warnings": [
"No transcripts found that fully overlap the described variation in the genomic sequence"
]
},
"metadata": {
"seqrepo_db": "2018-08-21",
"uta_schema": "uta_20180821",
"variantvalidator_hgvs_version": "1.2.5.vv1",
"variantvalidator_version": "1.0.4.dev42+gbb2b0b7"
}
} Again, there seems to be something odd with the alignment of NBPF1. We do not have it down as a gap gene, but the UCSC status is
LOVD EP produces {
"NC_000001.10:g.16891340T>A": {
"errors": [],
"flag": "processing_error",
"NC_000001.10:g.16891340T>A": {
"p_vcf": "1:16891340:T:A",
"g_hgvs": "NC_000001.10:g.16891340T>A",
"selected_build": "GRCh37",
"genomic_variant_error": null,
"hgvs_t_and_p": {
"NM_017940.4": {
"t_hgvs": null,
"p_hgvs_tlc": null,
"p_hgvs_slc": null,
"gapped_alignment_warning": null,
"gap_statement": null,
"transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
"primary_assembly_loci": {
"grch37": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
"vcf": {
"chr": "1",
"pos": "16891340",
"ref": "T",
"alt": "A"
}
}
},
"hg19": {
"NC_000001.10": {
"hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
"vcf": {
"chr": "chr1",
"pos": "16891340",
"ref": "T",
"alt": "A"
}
}
},
"grch38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
"vcf": {
"chr": "1",
"pos": "16564845",
"ref": "T",
"alt": "A"
}
}
},
"hg38": {
"NC_000001.11": {
"hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
"vcf": {
"chr": "chr1",
"pos": "16564845",
"ref": "T",
"alt": "A"
}
}
}
},
"alt_genomic_loci": {
"grch37": {},
"hg19": {},
"grch38": {},
"hg38": {}
}
}
}
}
},
"metadata": {
"variantvalidator_version": "1.0.4.dev11+g97aec97",
"variantvalidator_hgvs_version": "1.2.5.vv1",
"uta_schema": "uta_20180821",
"seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
"variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
}
} So again the transcript is identified. I'll do some additional digging. I am still thinking out alignment may need an update though |
OK, so here is our alignment for [['NM_017940.4', 'NC_000001.10', -1, 'splign', 16888921, 16940100]] It's a total mess of a gene. I would not trust UCSC or Mjutalyzer. I suspect our alignment may also need an update, but it goes to show how bad this gene is. I'll look at the actual positions later and see why the genomic position entered may be being returned as out-of-bounds currently then check the RefSeq alignment
|
OK, so this is odd, right? 16891340 doesn't overlap with any exon there, but it does with your genomic range above. That range is bigger by 4753 bases downstream of the gene (5' in the chromosome). I don't know what data comes from where, but shouldn't the numbers you mention there simply be based on the smallest and largest positions in the list of exon alignments? |
No, that's genome annotation, we are providing alignments hence the CIGAR MUST be considered which tools like UCSC, Mutalyzer and others fail to do. I need to look more closely, but the alignment is a mess. Will look ASAP |
Of course, you should consider the CIGAR, but that's not my point - you have the strange situation where VV reports a transcript to overlap a position, and then it says it doesn't. Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV?
No stress, take your time on this. I'm running VV on our entire LOVD instance, I'll be busy for a long time with this... |
Absolutely, agree with Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV? Need a puzzle to solve :) |
Sorry, I see what you mean now. This is really unusual. I'm going to look into how the hgvs library fetches that range. |
So the method used is in here def get_tx_for_region(self, alt_ac, alt_aln_method, start_i, end_i):
"""
return transcripts that overlap given region
:param str alt_ac: reference sequence (e.g., NC_000007.13)
:param str alt_aln_method: alignment method (e.g., splign)
:param int start_i: 5' bound of region
:param int end_i: 3' bound of region
"""
return self._fetchall(self._queries['tx_for_region'], [alt_ac, alt_aln_method, start_i, end_i]) So the range is as you suspected the genomic coordinate bounds for the transcript. So the question is why is there no exon either side of this position???? The query position is vfo.hdp.get_tx_for_region(hgvs_genomic.ac, 'splign', hgvs_genomic.posedit.pos.start.base,
hgvs_genomic.posedit.pos.end.base - 1) where hgvs_genomic is NC_000001.10:g.16891340T>A |
So I suspect the errors are with UTA. Will email this link to Jon to look at in the new build. Jon, note we are looking at variants NC_000001.10:g.16891340T>A and NC_000001.10:g.1573181C>G |
Alignments for second variant
And
Again it looks like the range provided by hdp.get_tx_for_region doesn't match the range provided in the alignments. we also need to cross reference against RefSeq |
Here, the difference is even a staggering 15719 basepairs. I'm really curious where this comes from... |
We think it is a balls up when the transcript record was updated. We think the alignment data was updated but the range data returned by vfo.hdp.get_tx_for_region were not. This will be fixed with our UTA build. John is already looking into it. FYI, I think NM_017940.4 has been deprecated. |
Just to link; related to #152. |
Just to update; I have just sent out a survey to all our curators in the GV shared LOVD; they're asked to vote on the proposed description on cDNA level for variants (partially) outside of the transcript's reference sequence. I'll keep you updated on the results 😉 |
Another variant that voluntarily maps to transcripts (10, in this case), and then throws a |
That's very strange. If I validate NC_000007.13:g.50468071G>A using the interactive validator, it returns the warning: |
That is what the API says, I always use the APIs now. See the link that I posted (the variant holds the link). I bet the interface discards all ten transcripts before reporting that there is no overlap. |
The LOVD endpoint and the VV endpoint have slightly different processing. I still believe this is an alignment issue which Johns work will corrcect |
OK, cool! Would you like more examples when I run into them? Or perhaps that's just overkill... I really think that once John is done with rebuilding the database, I should just run all verifications again 😅 It's quite slow at the moment though (meaning, we have a lot of data); about 2% of the database contents is checked per day. |
In case you're interested, here's another one, but special because it also shows that UTA misses a new transcript version there. The version 2 of the transcript does in principle work, because |
If you can be bothered to post them then yes please because I add them to the PyTests |
Update: These are the results of the survey: A majority of curators wants to see a variant description that acknowledges the size of the variant. I was reminded by one of the curators of the previously suggested nomenclature: "c.3887_*1017+d31817del". The HGVS earlier rejected this proposal, but well, having a description at all for this variant has been dismissed by the HGVS. This option, together with option B above, would also allow us to map back to the genome, as long as we store which reference sequence was used to describe the variant in the first place (like "NC_000016.9(NM_000296.3):c.3887_*32834del"). In that sense, it'll be just like an intronic variant. |
Although this issue started with reporting the typo in the error, we also discussed handling variants that (partially or wholly) fall outside of genes. What do you think we should do with that? Or do you want me to submit a new report? E.g., Currently, VV returns no mapping at all, even though it partially overlaps a gene (see above for some possible descriptions that we can use). |
I think open a new report @ifokkema. |
p.s. does c.3887_*32834del go beyond the length of the transcript. If so, I. will not accept this as it is an illegal descriptions because you cannot just infer bases in a reference sequence that do not exist. |
Sure, I'll open a new request! |
Open the new issue and I will comment there |
Excellent! Opened #333. |
On the LOVD endpoint, variants crossing gene boundaries, such as the BRCA1 promotor deletion variant NC_000017.10:g.41271863_41308933del, generate the flag
porcessing_error
(also note the typo) and atranscript_variant_error
value of "start or end or both are beyond the bounds of transcript record". The mapping on NM_007294.3 was provided by VV itself, so I'd argue it should then be able to handle the mapping. Mutalyzer claims this variant should map to NC_000017.10(NM_007294.3):c.-31665_81-4067del, although I understand that these mapped positions cannot be represented in this NM record.I'm not sure if you'd want to support this notation; the HGVS nomenclature website clearly states:
Not sure if there was an outcome to this, but obviously, LOVD is massively breaking these regulations because some kind of mapping needs to be given to indicate this genomic variant has an effect on the expression of the gene. Any thoughts?
The text was updated successfully, but these errors were encountered: