-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combining allelic variants prior to generating the protein-level description #511
Comments
@leicray The Variant Nomenclature Committee now has a GitHub page. https://github.com/HGVSnomenclature/hgvs-nomenclature/issues?q= Think this should map to there now. This is an open topic I believe |
I'm curious how this was resolved? I mean, is this meant as a "won't fix" then? I think it's a very interesting use case to have one single protein description from a combined allele. |
Not currently resolved @ifokkema. Closed because as things stand the HGVS rule is being applied. Variants that can be brought proximal to each other are merged. Those then cannot are not, so nothing (currently) to do. HGVS rules currently state that these are separate variants. I agree with Raymond that for the case of perdicting the p. outcome, they should be allowed to be merged, but was overruled in the Variant Nomenclature Committee. |
Actually, I disagree with that. 😉 These variants overlap on protein level, and therefore, since they're given as one allele, I believe they should be returned as one allele and not as two separate predictions. Each variant should be considered separately on each level. Meaning, two variants far enough apart on DNA level to be considered two separate variants, on protein level may be directly adjacent. When that happens, HGVS nomenclature states they should be merged on protein level (assuming we're talking about an allele here). They shouldn't be merged on DNA level, as each level is considered separately. Since That said, separately from the discussion of whether or not the protein variants overlap, I also believe that since the user provided one variant description (an allele), they should not get two responses. Even if the two DNA variants in the allele do not overlap on protein level, I believe an allele syntax should be returned. Now, VV behaves as if the allele syntax is a way to submit a batch query. |
The fact remains that the consequence of the two individual frameshifts is that there is no frameshift at all. The Whatever we decided about how to describe the allelic variants at the transcipt level, the protein-level description ought to be |
Don't get me wrong, i disagree too, and brought this up un the HGVS committee and was over-ruled. However, this might be a good case to bring this up again and try and tweak the rules. VV currently only merges c. descriptions if they normalize together. This is not incorrect! However, a dimension that has not been considered is whether when 2 separate variants are cponsidered at the c. level, whether the p. needs to be merged. In this case it may be. That does sound like it should apply. I can prototype it. |
I ran into this when reviewing the HGVS nomenclature pages:
(source) It seems the nomenclature actually understands that contradicting protein descriptions in one allele doesn't make sense. They even went as far as merging variants on the DNA level when they affect one codon. It seems to me the nomenclature supports this cause. |
Think we need to re-open this then. I have been working on the allele syntax so now is a good time.
Interesting. This is one that can certainly be coded up as it is a fixed and simple rule with an example
For variants like this, My thought is, could we use the r. syntax as an intermediate. Then we can maintain the "correct" nomenclature for c. i.e. not merge, but also generate a delins for the r. and additionally give the protein consequence based on the r. nomenclature |
The solution that you propose for dup and del variant makes perfect sense. |
Another interesting find: https://groups.google.com/g/hgvs-nomenclature/c/xBJg2sNIM_s Johan clearly combines the two DNA variants into one single protein variant, so I'd say we're good to go. We wouldn't even need the RNA for this (but it could work). |
Is that the agreed opinion of the HGVS nomenclature committee? If so, it contradicts what's said on the web site regarding the maximum separation of the two allelic variants. |
The RNA would not be needed for c.[145C>T;147C>G] However, I think it is more appropriate to use for NM_000059.4:c.[1916dup;1929del] where existing HGVS rules would prevent merging. |
I believe so, and no, it doesn't. Digging even further, I actually found this on the protein frameshift page:
This seems clear to me. Only one allele should be described, and it should describe the combined effect of the two variants. |
So far I have "submitted_variant": "NM_004006.2:c.[145C>T;147C>G]",
"transcript_description": "",
"validation_warnings": [
"AlleleSyntaxError: Variants [145C>T;147C>G] should be merged into NM_004006.2:c.145_147delinsTGG"] Which covers the c.[145C>T;147C>G] case. However, it also expands the logic so that for any variants where the last base of the 5 prime variant and the first base of the 3 prime variant are in the same amino acid, then we merge However, is this against the current HGVS rules. In my opinion it is unless both variants are types that cannot be normalized e.g. substituttions. |
This may be clear, but it is not particularly practical to code up. I'm also looking at the practicality of what is being stated. In my opinion, we merge if 2 frame shifts restore the original reading frame. That makes sense. The rest of the scenarios listed in these rules would only really lead to predictions as to whether NMD would occur, but we are still gonna get LOF. We can revisit this if we think that it is worth committing time to, but my opinion is that we at this stage only merge variants if the original frame is restored. |
This then raises another question. What if we have a non frameshifting variant between the 2 frame-shifts, these likely need to become part of the merge. Its begnning to look to me as if we need to always merge if we see 2 frameshifts in a list of alleles but this would not really comply with strict HGVS rules. Not entirely sure where to go with this and what is feasible |
I agree entirely that we ought to combine two frameshifting variants where their combination results in restoration of the reading frame beyond the second variant. If possible, we ought to also accommodate any non-frameshifting variants that lie between the two frameshifts. I say this knowing that this might be contrary to the HGVS nomenclature guidelines, but sometimes the law is an ass. |
I believe that this is part of the new proposal, which hasn't been finished yet. Do you know the status of this?
I can't comment on how hard it is to program in VV since I don't know its code; but I would try to prevent generating incorrect HGVS. A simple rule would be that an allele syntax on protein level is incorrect if the non-last variant is a frameshift variant. Because that means later variants are given relative to a sequence that no longer exists.
Anything following a frameshift variant should become part of the merge.
It fully complies; the frameshift causes all following variants to overlap with the frameshift variant.
This would be fully HGVS-compliant, but it should include any variant that follows a frameshift variant. |
I'm beginning to get confused here.
Does this mean every frameshift? What if another allelic variant lies 100 nucleotides downstream of the second frameshift in the scenario that we are discussing. I suspect that this might need a TC rather than an exchange of comments. |
I have managed to write code that looks for the following pattern
However,
|
When there's a second frameshift that restores the reading frame, there's essentially no frameshift anymore, and you get a deletion-insertion. So I meant frameshift, as a frameshift whose frame stays shifted 😅
That's also an option.
Basically, variants overwriting each other or touching each other on the protein level, should be merged. E.g.,
|
These 2 I can do for now. |
Describe the bug
A user has queried why the variant description
NM_000059.4:c.[1916dup;1929del]
is handled as two separate variants for the purposes of validation:NM_000059.4:c.1916dup
predicting protein-level descriptionNP_000050.3:p.(Leu639PhefsTer13)
, andNM_000059.4:c.1929del
predicting protein-level descriptionNP_000050.3:p.(Arg645GlufsTer15)
.Treated separately, each predicts a frameshift, but if the pair of variants are considered together as a delins variant, notwithstanding their separation, the predicting protein-level description becomes
NP_000050.3:p.(Leu639_Val643delinsPheAlaPhePheCys)
.Examples such as this need to be handled more intelligently. This probably requires that allelic descriptions be combined before generating the protein-level description.
The text was updated successfully, but these errors were encountered: