Download 1,000 Random Human RefSeq Transcripts #57

mapauley · 2022-06-12T14:24:46Z

Hi all,

I want to download a specific number (e.g., 1,000) of random--it's essential that they be random, although "representative" is perhaps a better word--RNA reference sequence transcripts, preferably from a specific reference build (e.g., GRCh38), although this isn't super important. Any thoughts on how to do this?

By random I mean that there would be nothing to distinguish one batch of 1,000 sequences from another, e.g., in the number of curated (NM_, NR_) versus model (XM_, XR_) sequences. As alluded to above, my goal is to have a completely representative subset of transcript sequences.

Mark

vkkodali · 2022-06-12T16:27:11Z

I am assuming that you are interested in fetching sequences in FASTA format.
I suggest you download a file with all transcript sequences in FASTA format for a specific genome assembly's latest annotation using NCBI Datasets and then use a program like seqkit to extract n random sequences from the file.
EntrezDirect is not the most efficient way to download data in bulk. But if you don't have a choice, one way to achieve something like this would be to: (1) get a list of all accessions using a combination of esearch and efetch -format acc, (2) use a Unix tool like sort -R or shuf to extract n random accessions, then (3) use a combination of epost and efetch to download sequences in FASTA format.

mapauley · 2022-06-13T12:34:33Z

Very helpful. Thank you! I downloaded the GRC38 patch Patch 14 transcripts, extracted just the accession
numbers, randomized the result using shuf, and then downloaded the corresponding records via efetch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download 1,000 Random Human RefSeq Transcripts #57

Download 1,000 Random Human RefSeq Transcripts #57

mapauley commented Jun 12, 2022 •

edited

Loading

vkkodali commented Jun 12, 2022

mapauley commented Jun 13, 2022

Download 1,000 Random Human RefSeq Transcripts #57

Download 1,000 Random Human RefSeq Transcripts #57

Comments

mapauley commented Jun 12, 2022 • edited Loading

vkkodali commented Jun 12, 2022

mapauley commented Jun 13, 2022

mapauley commented Jun 12, 2022 •

edited

Loading