You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to download a specific number (e.g., 1,000) of random--it's essential that they be random, although "representative" is perhaps a better word--RNA reference sequence transcripts, preferably from a specific reference build (e.g., GRCh38), although this isn't super important. Any thoughts on how to do this?
By random I mean that there would be nothing to distinguish one batch of 1,000 sequences from another, e.g., in the number of curated (NM_, NR_) versus model (XM_, XR_) sequences. As alluded to above, my goal is to have a completely representative subset of transcript sequences.
Mark
The text was updated successfully, but these errors were encountered:
I am assuming that you are interested in fetching sequences in FASTA format.
I suggest you download a file with all transcript sequences in FASTA format for a specific genome assembly's latest annotation using NCBI Datasets and then use a program like seqkit to extract n random sequences from the file.
EntrezDirect is not the most efficient way to download data in bulk. But if you don't have a choice, one way to achieve something like this would be to: (1) get a list of all accessions using a combination of esearch and efetch -format acc, (2) use a Unix tool like sort -R or shuf to extract n random accessions, then (3) use a combination of epost and efetch to download sequences in FASTA format.
Very helpful. Thank you! I downloaded the GRC38 patch Patch 14 transcripts, extracted just the accession
numbers, randomized the result using shuf, and then downloaded the corresponding records via efetch.
Hi all,
I want to download a specific number (e.g., 1,000) of random--it's essential that they be random, although "representative" is perhaps a better word--RNA reference sequence transcripts, preferably from a specific reference build (e.g., GRCh38), although this isn't super important. Any thoughts on how to do this?
By random I mean that there would be nothing to distinguish one batch of 1,000 sequences from another, e.g., in the number of curated (NM_, NR_) versus model (XM_, XR_) sequences. As alluded to above, my goal is to have a completely representative subset of transcript sequences.
Mark
The text was updated successfully, but these errors were encountered: