Skip to content

Getting data from INSDC databases like NCBI

Finlay Maguire edited this page Mar 25, 2021 · 2 revisions

You have many options for getting data out of the main sequencing databases such as NCBI-GenBank/EBI-ENA/DDBJ (collectively known as the INSDC databases. These are summarised here among other places.

  1. Manually browsing the relevant sub-database e.g. gene, protein, assembly, refseq etc. via the web search

  2. Using the batch-entrez service to query using a list of accessions.

  3. Using the NCBI command-line tools including:

    • entrez-direct is a general set of composable tools to perform all the entrez functions
    • sra-toolkit a set of tools for specifically downloading raw sequencing data.
  4. Within a script using the entrez programming libraries built into biopython or bioperl

  5. Via the new NCBI datasets web interface and tools, this is still being actively developed but is very powerful.

Raw read data

Here is an example of how to download raw read data from ENA and SRA, for SRA

SRA using SRA-tools for a text file containing a list of SRA accessions (acquired from BioProject/Entrez/Paper):

# create a folder for your files
mkdir -p sra

# use prefetch in SRA-tools to download all the .sra formatted raw reads
prefetch --progress --verify yes --resume yes --output-directory sra --option-file sra_accessions.txt

# validate that those downloads worked correctly using vdb-validate in SRA-tools 
vdb-validate sra/*

# convert .sra files to .fastq files and compress them to make them smaller
mkdir -p reads
for acc_file in sra/*
    do  
        # make a folder for each accession (optional but nice and tidy!)
        acc=$(basename "$acc_file") 
        mkdir -p reads/"$acc"
        
        # use fasterq-dump from sra-tools to unpack .sra files into paired-end .fastq file
        fasterq-dump -p -O reads/"$acc" --threads 4 "$acc"
        
        # use pigz to compress those .fastq in parallel (pigz = parallel gzip)
        pigz -p 4 reads/"$acc"/*.fastq
done

Alternatively for ENA downloads, you can just use a simpler wget or curl from the accessions via the web-page but you have to check the md5 hashes a bit more manually. I have some accessory scripts here to make this more convenient. This requires aspera installed to run quickly.

# get the md5sum (hashes) for the reads from the ENA website
python get_ena_checksums.py ena_accessions.txt > ena_checksums

mkdir -p reads
cd reads

# download the reads using aspera and this perl script
perl ../sra_download.pl --ascp ena_accessions.txt

# get the md5sum (hashes) for all the downloaded files 
find . -name '*.gz'  -exec md5sum {} \; >> download_checksums
cd ..

# compare the downloaded md5sums (hashes) to the ones on ENA originally to make sure the files were downloaded correctly
python compare_checksums.py ena_checksums download_checksums
Clone this wiki locally