Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mapping support for variants only partially intergenic #333

Open
ifokkema opened this issue Jan 20, 2022 · 8 comments
Open

Add mapping support for variants only partially intergenic #333

ifokkema opened this issue Jan 20, 2022 · 8 comments

Comments

@ifokkema
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
For LOVD to store a gene-specific effect of a variant, LOVD must store the mapped gene-level representation of that variant. While it is understandable that intergenic variants can not be mapped to genes, variants that do overlap genes should always have a gene-level representation, even if they are also partially intergenic. But, variants entirely deleting genes with both of the deletion's endpoints outside of the gene's bounds, currently do not report any mapping (tested the LOVD endpoint and the VV endpoint). Also, variants deleting half of a gene with the other endpoint outside of the gene's bounds also do not report any mapping. E.g., NC_000016.9:g.2106894_2161281del.

Describe the solution you'd like
In order for LOVD to "discover" an effect on the gene, VV should return a mapping. An issue is, however, that the HGVS nomenclature doesn't have any valid rules currently that can describe such a variant.
We conducted a poll among 400 LOVD curators, asking them what description would be best to be used on the gene level. It was highlighted to them that none of the possibilities were HGVS compliant, so they were purely asked about their preference. In total, 54 curators replied.

Note, for all given descriptions, the intended reference sequence is NC_000016.9(NM_001009944.2), equal to intronic variation.
The given options were;

A. "-" (an empty description)
This gives no details on what region of the coding DNA reference sequence is affected.

B. "c.3887_*32834del"
This is the current output that the Mutalyzer tool generates, linked to the LOVD database, to check and create variant descriptions. Mutalyzer maps the variant's endpoint assuming c.* numbering continues forever. It gives a clear indication of the full deletion's size but is not supported by HGVS.

C. "c.3887_*1017del"
This does give details on what region of the coding DNA reference sequence is affected (c.*1017 is the last base of this reference sequence), but it suggests the deletion has been sequenced as c.3887_*1017del. Since, in fact, the deletion extends beyond c.*1107, this description is not correct.

D. "c.3887_(*1017_?)del"
This does give details on what region of the coding DNA reference sequence is affected and shows more sequence has been deleted, although it suggests the endpoint of the deletion is not known while it is.

E. "c.3887_*1017[0]"
This new format does give details on what region of the coding DNA reference sequence is affected, and the [0] suggests it is present in 0 copies, so deleted. However, the format may be confused with the HGVS allele format, which also uses []. NOTE: For a duplication, we would use [2].

F. "c.3887_*1017{0}"
This new format does give details on what region of the coding DNA reference sequence is affected, and the {0} suggests it is present in 0 copies. Since HGVS does not use {}, there can not be any confusion. NOTE: For a duplication, we would use {2}.

Note, as a response to the survey, Peter Taschner noted another option;
G. "c.3887_*1017+d31817del"
This has been proposed before but was rejected by the HGVS. It indicates clearly the extent of the deletion, including the extent of the reference sequence, and more closely resembles the intronic variant notation.

The results were as follows;
survey_results_2020-07-06

My personal worry is also to generate any description that cannot be mapped back to the genome. I.e., options A, C, D, E, and F, can not be mapped back to the genome if their source was the transcript. So, information is lost. Personally, I feel that the "we cannot describe positions not mentioned in the reference sequence" is solved by using the NC(NM) construct, just like intronic variants are handled now. I haven't heard any argument why it can't work like this, that would also not apply to how we describe intronic variants.

Describe alternatives you've considered
Note that Mutalyzer currently uses option B and that descriptions like these are currently widely spread in LOVD.

Additional context
Note, that Johan decided to ignore the wishes of the curators, and decided to implement option F in the GV shared LOVD. For many "new" submissions (up to one and a half years old or so), option F is used and not B.

@Peter-J-Freeman
Copy link
Collaborator

My initial thoughts on this @ifokkema and @leicray is to set up a vid call. I have huge concerns with opening the can of worms again that is letting variant descriptions in the context of transcript reference sequences whereby there would be a need to describe variation beyond the boundaries of the reference sequence. It is not a good idea, so I think we need to have a very good think about this and make recommendations for the HGVS SVD group

@ifokkema
Copy link
Collaborator Author

Sounds good! Another thing that popped up in my head is that this is also related to fusion transcripts. Deletions like these can cause fusion transcripts, and those do have a transcript-based description. So we might also go in that direction, even though that doesn't solve whole-gene deletions yet but only deletions where half genes are deleted.
On a related note; did "recruiting" for the SVD group already start? I'm interested to join. Same for the VIJ group. Even though I'm already incredibly busy, it's important for me to be involved in these.

@Peter-J-Freeman
Copy link
Collaborator

Fusions are on the agenda for description formats that we need to crack. @leicray has certainly been working in this area. We should definately talk about those too. Let's sort out some dates via email.

Not sure about the SVD recruitment. Another thing to chat about

@leicray
Copy link
Contributor

leicray commented Jan 25, 2022

Please include me in the any proposed chat session.

@Peter-J-Freeman
Copy link
Collaborator

You are needed

@ifokkema
Copy link
Collaborator Author

ifokkema commented Oct 7, 2024

As a heads-up; the HVNC just decided that in a genomic context, the transcript coordinate system (NM:c) can be used to indicate upstream and downstream positions, just like currently intronic positions are described that way.

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Oct 8, 2024

@ifokkema Thanks for the heads up. I'll add this to the to-do list. I object by the way, but rules is rules :P

Can I ask you to open a specific feature request. Is there any links to this in the HGVS sites yet. Not a problem if not, we just will track once there are

@ifokkema
Copy link
Collaborator Author

ifokkema commented Oct 8, 2024

@ifokkema Thanks for the heads up. I'll add this to the to-do list. I object by the way, but rules is rules :P

Haha! Since it solves a major issue in LOVD, as well as this issue, I'm all for it 😅 Also, it re-aligns VV and Mutalyzer.

Can I ask you to open a specific feature request. Is there any links to this in the HGVS sites yet. Not a problem if not, we just will track once there are

Yep, I just created #652. So far, we only have the issue in the HGVS Nomenclature repo; we'll work on updating the website soon. As you can imagine, quite a few pages should be adjusted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants