-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update if possible the alignments for RYBP #656
Comments
Hello Pete. This is concerning, NM_012234.7 looked fine to me for GRCh38, and should be up to date. Which specific other sources? You say that have a table that you have gone through, do you have the differences between us and the other sources along with the source that showed the difference? As mentioned before there is, or was a UTA issue with the .6 version of this transcript, so which tool is giving different results is important. It is also worth noting that there are internal length differences between .7 and .6 for this transcript so this also might be an issue if comparing form data using .6 while using a tool that improperly handles old data. We do have a set of newer GRCh37 data just published last month that should help when I get time to load it, but only if the NM_012234.7 data is fine, if NM_012234.7 is broken too then I need to know how, and possibly re-load the data for the last few releases too! Additional context
All new RefSeq alignments since the branch of VVTA from the UTA have been loaded from RefSeq with the alignment left unchanged from the official release. We did have to convert alignments to the extended CIGAR format, though this has now changed, but the data in this is redundant, hence the arguments over whether to use it among the community. As such this does not change the meaning of the alignment, and can be considered a loading step not an edit. We have a number of quality filters (no transcripts without HGNC id for example) but no other transformations are made to the alignments. Despite adding code to handle the reported differences between published ENSEMBL transcripts and the genome for some older data none of the datasets we load have this issue. I suspect that these user reports where caused by the "Same ID different sequence for GRCh38 vs GRCh37" issue which we handle by suffixing the sequences. As such we also do not change any of these alignments on load either. We add data incrementally, hence the "each new release adds" wording which means that unless a transcript alignment gets superseded by new alignments, for the same transcript version, it wont get removed or changed. Even the old UTA data may be preserved, which has been a problem due to the issues with the "every alignment has the same exon position set" assertion built into the UTA, which we fixed. I attempted to purge every alignment broken by this issue but without dumping all the UTA data 100% safety from this issue is impossible, hence the issues with NM_012234.6. The UTA also rebuilds it's alignments from the spans, any data loaded before the UTA->VVTA branch point is also subject to this. |
AFAIK, Ensembl transcript ENST000004777973.2 maps onto GRCh37 and GRCh38 without gaps. However, the current version is .4 rather than .2. In addition, ENST000004777973.2 comprises just 4 exons. whereas ENST000004777973.4 comprises 5 exons. |
@leicray please try to remain seated, but for this gene I think having at least an Ensembl transcript for each genome build is the best option. One problem. The only protein codeing transcript in ensembl is CDS incomplete We need to have an Ens and RefSeq for both genomes ideally. Just hope we can do something, but the issue seems to be that the CDS cannot fully alugn to the genome in the places most tools expect the transcripts to be placed |
That's a bit odd. The HGNC page for RYBP provides a link to ENST00000477973.4 but not to ENST00000477973.5. I wonder why that might be. As an aside, how does one navigate on the Ensembl website from one transcript version to another version, or search the site for a given transcript version? By comparison, the NCBI site is very easy. |
The only way to look at versions through time in Ensembl is to build your own Ensembl and load the different releases. They only keep live the latest GRCh37 and latest GRCh38 release. |
That's not very helpful, is it? Sorry, I forget that we were discussing Ensembl... |
Wait a minute. There are 2 entries for the RYBP Gene https://www.ensembl.org/Homo_sapiens/Transcript/Idhistory?db=core;g=ENSG00000281766;r=HG126_PATCH:49171-122044;t=ENST00000643872. This is the gene on a scaffold Scaffold HG126_PATCH: 49,171-122,044 reverse strand and has the MANE Select https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000163602;r=3:72371825-72446621;t=ENST00000477973. THis is the GRCh38 chromosomal position Chromosome 3: 72,371,825-72,446,621. Which confirms that GRCh38 is totally stufed for this gene. |
Absolutely bang on the mark! For gene version ENSG00000281766 the most up to date transcript is ENST00000643872.4. However, for gene version ENSG00000163602 the most up to date transcript is ENST00000477973.5. To sort this out, it might also be necessary to look at Ensembl's internal IDs for the individual transcripts. |
We have a work around that is taking shape. See #657. We could do with more though on this gene though |
Describe the bug
At the moment we do not have any good GRCh37 transcript alignments RefSeq or Ensembl.
For GRCh38, we have a refseq transcript and an Ensembl. But the alignment is very different from other sources.
To Reproduce
I have a table I can walk through
Expected behavior
Unknown
Additional context
We need a clear statement in the docs where our alignments come from. I believe they are direct from RefSeq. But we do a little processing. We probably need to spell this out
The text was updated successfully, but these errors were encountered: