-
Notifications
You must be signed in to change notification settings - Fork 2
02. Bases de datos y formatos de archivos NGS
Explain the type for each id, get the information listed in NCBI
- SRS372877
- SRS1165894
- SRX527715
- SRP039339
- SRR1262828
If you have an ftp address (for example if you submit the SRA accession name to the ENA database, and you click on "experiment" or "run" you get several links to files. If you right-click you can copy to your clipboard the ftp address, and use it with wget on the terminal) Note: the following example is fake, do not try to use it.
wget ftp://the/ftp/address/that/you/found.txt
Some example of commands:
fastq-dump -F SRA_accession # Downloads data as single-end reads (in a single fastq file) conserving original read names
fastq-dump --gzip SRA_accession # Downloads data as a compressed fastq file
fastq-dump -I --split-files SRA_accession # Downloads data as paired-end reads (in 2 files) using standard NCBI-assigned identifiers (read names) and appending read id (.1 or .2)
Note:
- The original sequence/read names usually are in one of the versions of the Illumina sequence identifier format.
- The default read name format generated by
fastq-dump
is the NCBI-assigned identifiers with the format:@accession.spot some description length100
. In this format 'spot' is just a number that distinguishes the reads. See also the ENA format description.
You can find links to several different data types on this page. We will go into details and exercises with most of them at the practice that we will need them during the course. In this practical, we will focus on the formats FASTA and FASTQ.
From your home directory, list the files of the directory FileFormats of the course:
ls -l /home/curso/data/FileFormats/
WITHOUT copying any file, check the different fasta files there. The extension .fa
is also a fasta file. Read the content of the files (using less
command), and try to understand:
- What is the content of each file and what are the main differences?
- How many sequences does each file contain? (Hint: use the commands
wc -l
andgrep -c "^>"
) - Print only the id of each sequence for each file. (Hint: use the commands
grep "^>"
)
In a fastq file, in contrast to a simple fasta file, encoded along with the base calls are also the Phred scores for each call. Phred score qualities are by definition log-scaled, easy to understand, and fit a probabilistic framework.
a. You are 100% certain that a base call is wrong. What is the Phred score assigned to that base call?
b. Call_1 has a Phred score of twice as high as call_2. How many times is our "trust" higher for call_1 compared to call_2 when
- Q of call_1 is 10?
- Q of call_1 is 30? (Hint: our "trust" is calculated by the probability of the call to be wrong or correct)
For space reasons, phred scores are encoded in fastq files in the American Standard Code for Information Interchange (ASCII).
The ASCII code has 128 basic characters each assigned a number. For example: A65, B66, ?63, and so on.
- Have a look at your ASCII chart (only columns "Dec" and "Char") and write the word
CAT
in ASCII. - Write the word
cat
in ASCII. - Create a file called "name.txt" in your home directory and write your name in ASCII in it.
Convert the following phred scores to sanger encoding. (Hint: Sanger encoding is Phred +33)
>read1
ATGCTGGT
10 20 30 40 15 24 32 39
In the directory /home/curso/data/FileFormats
, you will find two fastq files named sample_1.fq
and sample_2.fq
(remember you don't need to copy the files).
Can you figure out:
- How many reads are there in each file?
- What is the name of the machine?
- Which file contains the first read in pair and which one the second?
Sequence Alignment/Map Format Specification SAM format - The official SAM/BAM specification Decoding SAM flags - Bitwise flags
BED File Format - Definition and supported options
In the directory /home/curso/data/FileFormats
, as you did for the above exercises.
Check the different files, and guessing from their names, inspect the different formats, keeping in mind the flowchart of the NGS data processing workflow that we discussed during the lecture (page 25, File formats presentation). How many entries/entities each file contain? What can you say about the content of each file?
(Hint: if you want to look into files with many columns, or simply long lines, the flag -S to the less command will make it a bit easier, as it prints one line per line, without wrapping the text.)
less -S