Skip to content

02. Bases de datos y formatos de archivos NGS

Gustavo A. Silva-Arias edited this page Aug 24, 2021 · 3 revisions

Gathering NGS data

Data Repositories

SRA Run Selector

Problem 1

Explain the type for each id, get the information listed in NCBI

  • SRS372877
  • SRS1165894
  • SRX527715
  • SRP039339
  • SRR1262828

Direct downloading data using "wget"

If you have an ftp address (for example if you submit the SRA accession name to the ENA database, and you click on "experiment" or "run" you get several links to files. If you right-click you can copy to your clipboard the ftp address, and use it with wget on the terminal) Note: the following example is fake, do not try to use it.

wget ftp://the/ftp/address/that/you/found.txt

Downloading data using SRA Toolkit

Some example of commands:

fastq-dump -F SRA_accession # Downloads data as single-end reads (in a single fastq file) conserving original read names
fastq-dump --gzip SRA_accession # Downloads data as a compressed fastq file
fastq-dump -I --split-files SRA_accession # Downloads data as paired-end reads (in 2 files) using standard NCBI-assigned identifiers (read names) and appending read id (.1 or .2)

Note:

  • The original sequence/read names usually are in one of the versions of the Illumina sequence identifier format.
  • The default read name format generated by fastq-dump is the NCBI-assigned identifiers with the format: @accession.spot some description length100. In this format 'spot' is just a number that distinguishes the reads. See also the ENA format description.

Data File Types

You can find links to several different data types on this page. We will go into details and exercises with most of them at the practice that we will need them during the course. In this practical, we will focus on the formats FASTA and FASTQ.

FASTA

Fasta Format

Problem 2

From your home directory, list the files of the directory FileFormats of the course:

ls -l /home/curso/data/FileFormats/

WITHOUT copying any file, check the different fasta files there. The extension .fa is also a fasta file. Read the content of the files (using less command), and try to understand:

  • What is the content of each file and what are the main differences?
  • How many sequences does each file contain? (Hint: use the commands wc -l and grep -c "^>")
  • Print only the id of each sequence for each file. (Hint: use the commands grep "^>")

FASTQ

In a fastq file, in contrast to a simple fasta file, encoded along with the base calls are also the Phred scores for each call. Phred score qualities are by definition log-scaled, easy to understand, and fit a probabilistic framework.

Problem 3

a. You are 100% certain that a base call is wrong. What is the Phred score assigned to that base call?

b. Call_1 has a Phred score of twice as high as call_2. How many times is our "trust" higher for call_1 compared to call_2 when

  • Q of call_1 is 10?
  • Q of call_1 is 30? (Hint: our "trust" is calculated by the probability of the call to be wrong or correct)

Problem 4

For space reasons, phred scores are encoded in fastq files in the American Standard Code for Information Interchange (ASCII).

The ASCII code has 128 basic characters each assigned a number. For example: A65, B66, ?63, and so on.

  • Have a look at your ASCII chart (only columns "Dec" and "Char") and write the word CAT in ASCII.
  • Write the word cat in ASCII.
  • Create a file called "name.txt" in your home directory and write your name in ASCII in it.

Problem 5

Convert the following phred scores to sanger encoding. (Hint: Sanger encoding is Phred +33)

>read1
ATGCTGGT
10 20 30 40 15 24 32 39

Problem 6

In the directory /home/curso/data/FileFormats, you will find two fastq files named sample_1.fq and sample_2.fq (remember you don't need to copy the files).

Can you figure out:

  • How many reads are there in each file?
  • What is the name of the machine?
  • Which file contains the first read in pair and which one the second?

Other file formats: .sam .bed, .vcf, .gff

Data File Formats

Sequence Alignment/Map Format Specification SAM format - The official SAM/BAM specification Decoding SAM flags - Bitwise flags

Variant Call Format - VCF

GFF/GTF File Format

BED File Format - Definition and supported options

Problem 7

In the directory /home/curso/data/FileFormats, as you did for the above exercises. Check the different files, and guessing from their names, inspect the different formats, keeping in mind the flowchart of the NGS data processing workflow that we discussed during the lecture (page 25, File formats presentation). How many entries/entities each file contain? What can you say about the content of each file?

(Hint: if you want to look into files with many columns, or simply long lines, the flag -S to the less command will make it a bit easier, as it prints one line per line, without wrapping the text.)

less -S