Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update if possible the alignments for RYBP #656

Open
Peter-J-Freeman opened this issue Nov 20, 2024 · 9 comments
Open

Update if possible the alignments for RYBP #656

Peter-J-Freeman opened this issue Nov 20, 2024 · 9 comments
Assignees

Comments

@Peter-J-Freeman
Copy link
Collaborator

Describe the bug
At the moment we do not have any good GRCh37 transcript alignments RefSeq or Ensembl.

For GRCh38, we have a refseq transcript and an Ensembl. But the alignment is very different from other sources.

To Reproduce
I have a table I can walk through

Expected behavior
Unknown

Additional context
We need a clear statement in the docs where our alignments come from. I believe they are direct from RefSeq. But we do a little processing. We probably need to spell this out

@John-F-Wagstaff
Copy link
Collaborator

Hello Pete. This is concerning, NM_012234.7 looked fine to me for GRCh38, and should be up to date. Which specific other sources? You say that have a table that you have gone through, do you have the differences between us and the other sources along with the source that showed the difference? As mentioned before there is, or was a UTA issue with the .6 version of this transcript, so which tool is giving different results is important. It is also worth noting that there are internal length differences between .7 and .6 for this transcript so this also might be an issue if comparing form data using .6 while using a tool that improperly handles old data.

We do have a set of newer GRCh37 data just published last month that should help when I get time to load it, but only if the NM_012234.7 data is fine, if NM_012234.7 is broken too then I need to know how, and possibly re-load the data for the last few releases too!

Additional context
For the alignment sources if you want a statement for the docs (that won't need updating for each new release) then I am comfortable with the statement

The VVTA is based on the UTA, from Biocommons, with some structural improvements that we hope to (eventually) have the resources to share back. Each new release adds new data to the VVTA, loaded from the published RefSeq and Ensembl alignments, taken from the latest fixed release flat-files at the time of production. Some re-alignment may have been applied to the original pre-VVTA UTA data due, to it's internal processes, but all VVTA loaded alignment data should exactly match the published datasets at the time of release. Gene data is updated with a dataset taken from the latest monthly release of the HGNC gene dataset at the same time.

All new RefSeq alignments since the branch of VVTA from the UTA have been loaded from RefSeq with the alignment left unchanged from the official release. We did have to convert alignments to the extended CIGAR format, though this has now changed, but the data in this is redundant, hence the arguments over whether to use it among the community. As such this does not change the meaning of the alignment, and can be considered a loading step not an edit. We have a number of quality filters (no transcripts without HGNC id for example) but no other transformations are made to the alignments.

Despite adding code to handle the reported differences between published ENSEMBL transcripts and the genome for some older data none of the datasets we load have this issue. I suspect that these user reports where caused by the "Same ID different sequence for GRCh38 vs GRCh37" issue which we handle by suffixing the sequences. As such we also do not change any of these alignments on load either.

We add data incrementally, hence the "each new release adds" wording which means that unless a transcript alignment gets superseded by new alignments, for the same transcript version, it wont get removed or changed. Even the old UTA data may be preserved, which has been a problem due to the issues with the "every alignment has the same exon position set" assertion built into the UTA, which we fixed. I attempted to purge every alignment broken by this issue but without dumping all the UTA data 100% safety from this issue is impossible, hence the issues with NM_012234.6. The UTA also rebuilds it's alignments from the spans, any data loaded before the UTA->VVTA branch point is also subject to this.

@leicray
Copy link
Contributor

leicray commented Nov 21, 2024

AFAIK, Ensembl transcript ENST000004777973.2 maps onto GRCh37 and GRCh38 without gaps. However, the current version is .4 rather than .2. In addition, ENST000004777973.2 comprises just 4 exons. whereas ENST000004777973.4 comprises 5 exons.

@Peter-J-Freeman
Copy link
Collaborator Author

@leicray please try to remain seated, but for this gene I think having at least an Ensembl transcript for each genome build is the best option.

One problem. The only protein codeing transcript in ensembl is CDS incomplete
ENST00000477973.5
But we may have to live with it unless we find an older complete transcript.

We need to have an Ens and RefSeq for both genomes ideally. Just hope we can do something, but the issue seems to be that the CDS cannot fully alugn to the genome in the places most tools expect the transcripts to be placed

@leicray
Copy link
Contributor

leicray commented Nov 21, 2024

That's a bit odd. The HGNC page for RYBP provides a link to ENST00000477973.4 but not to ENST00000477973.5. I wonder why that might be.

As an aside, how does one navigate on the Ensembl website from one transcript version to another version, or search the site for a given transcript version? By comparison, the NCBI site is very easy.

@Peter-J-Freeman
Copy link
Collaborator Author

The only way to look at versions through time in Ensembl is to build your own Ensembl and load the different releases. They only keep live the latest GRCh37 and latest GRCh38 release.

@leicray
Copy link
Contributor

leicray commented Nov 21, 2024

That's not very helpful, is it? Sorry, I forget that we were discussing Ensembl...

@Peter-J-Freeman
Copy link
Collaborator Author

Wait a minute. There are 2 entries for the RYBP Gene

https://www.ensembl.org/Homo_sapiens/Transcript/Idhistory?db=core;g=ENSG00000281766;r=HG126_PATCH:49171-122044;t=ENST00000643872. This is the gene on a scaffold Scaffold HG126_PATCH: 49,171-122,044 reverse strand and has the MANE Select

https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000163602;r=3:72371825-72446621;t=ENST00000477973. THis is the GRCh38 chromosomal position Chromosome 3: 72,371,825-72,446,621.

Which confirms that GRCh38 is totally stufed for this gene.

@leicray
Copy link
Contributor

leicray commented Nov 21, 2024

Absolutely bang on the mark!

For gene version ENSG00000281766 the most up to date transcript is ENST00000643872.4. However, for gene version ENSG00000163602 the most up to date transcript is ENST00000477973.5.

To sort this out, it might also be necessary to look at Ensembl's internal IDs for the individual transcripts.

@Peter-J-Freeman
Copy link
Collaborator Author

We have a work around that is taking shape. See #657. We could do with more though on this gene though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants