-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gene class should have a method to get its sequence #270
Comments
PyEnsembl doesn't currently use a full genomic FASTA -- it's possible to point it at one but the sequences would differ from the transcript sequences (even at the same coordinates), which I think would be pretty confusing. |
In case there would be a plan to extend pyensembl's compatibility with full genome, I have a few functions in one of repositories that are based on pyensembl's way of caching the files.
So I would be happy to contribute, if there would be a plan to extend compatibility in the future. Following couple of things that could be challenging though:
|
I have also been thinking about adding this functionality now that I'm working on making the OpenVax tools work more cleanly with large/structural variants (and thus have to contend with more events outside of annotated exons). I would love a contribution but maybe we can talk a bit more about the design? A few thoughts:
|
About the last point, I haven't encountered any tool ideally suited to piggy-back on for fetching genome sequences. That's why in my python package, I ended up coupling (1) pyensembl's way to download and cacheing (or use already-downloaded genome) with (2) optional 2bit indexing for large genomes. It works quite well, but it is still to be tested extensively, e.g. with different species etc. There is also a web API that I know. It is quite fast, but (1) it does not cover all the genomes and (2) a limit on number/rate of queries is expected. In my experience, the challenge lies with the large genome e.g. human. For the smaller ones, the indexed genomes can be downloaded from ensembl (e.g.) and then basic tools such as pyfaidx can fetch sequences at decent rate. For large genomes though, using 2bit is the fastest approach, to my knowledge. The required tools can be downloaded from conda/mamba using this command:
Overall, I would vote for a 2bit based approach which I have already implemented in my repo. One thing to note however, in my implementation, an intermediate |
I see that transcripts have the method to get sequences while the gene class does not. Would be nice to be able to get genomic DNA of gene as well.
The text was updated successfully, but these errors were encountered: