Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download 1,000 Random Human RefSeq Transcripts #57

Open
mapauley opened this issue Jun 12, 2022 · 2 comments
Open

Download 1,000 Random Human RefSeq Transcripts #57

mapauley opened this issue Jun 12, 2022 · 2 comments

Comments

@mapauley
Copy link

mapauley commented Jun 12, 2022

Hi all,

I want to download a specific number (e.g., 1,000) of random--it's essential that they be random, although "representative" is perhaps a better word--RNA reference sequence transcripts, preferably from a specific reference build (e.g., GRCh38), although this isn't super important. Any thoughts on how to do this?

By random I mean that there would be nothing to distinguish one batch of 1,000 sequences from another, e.g., in the number of curated (NM_, NR_) versus model (XM_, XR_) sequences. As alluded to above, my goal is to have a completely representative subset of transcript sequences.

Mark

@vkkodali
Copy link

I am assuming that you are interested in fetching sequences in FASTA format.
I suggest you download a file with all transcript sequences in FASTA format for a specific genome assembly's latest annotation using NCBI Datasets and then use a program like seqkit to extract n random sequences from the file.
EntrezDirect is not the most efficient way to download data in bulk. But if you don't have a choice, one way to achieve something like this would be to: (1) get a list of all accessions using a combination of esearch and efetch -format acc, (2) use a Unix tool like sort -R or shuf to extract n random accessions, then (3) use a combination of epost and efetch to download sequences in FASTA format.

@mapauley
Copy link
Author

Very helpful. Thank you! I downloaded the GRC38 patch Patch 14 transcripts, extracted just the accession
numbers, randomized the result using shuf, and then downloaded the corresponding records via efetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants