-
Notifications
You must be signed in to change notification settings - Fork 2
06. Short Read Alignment a.k.a Mapping
We will use the same samples from Practical 3. For your convenience, I placed the quality-trimmed reads in folder /home/curso/data/quality_trimmed
. The objective of this exercise is to map ONE of these samples to the genome of Fusarium graminearum GenBank accession: GCA_900044135.1
Create a folder for today's exercise and move to the new folder:
mkdir mapping
cd mapping
Now, copy the reference genome file to your current location:
cp /home/curso/data/ref_genome/GCA_900044135.1_GZPH1RResV1_genomic.fna ./Fusarium_graminearum.fasta
cp /home/curso/data/ref_genome/GCA_900044135.1_GZPH1RResV1_genomic.gff Fusarium_graminearum.gff
Take a look at the FASTA file:
less -S Fusarium_graminearum.fasta
First we need to index the genome (check bowtie2
's manual for the options):
bowtie2-build Fusarium_graminearum.fasta Fusarium_graminearum.fasta
Check for newly created files with ls
:
Fusarium_graminearum.fasta.1.bt2
Fusarium_graminearum.fasta.2.bt2
Fusarium_graminearum.fasta.3.bt2
Fusarium_graminearum.fasta.4.bt2
Fusarium_graminearum.fasta.rev.1.bt2
Fusarium_graminearum.fasta.rev.2.bt2
Now, let's map the reads to the reference:
fq_folder=/home/curso/data/quality_trimmed
time bowtie2 -p 4 --local -x Fusarium_graminearum.fasta \
-1 $fq_folder/SRX1298579_R1.fastq.gz -2 $fq_folder/SRX1298579_R2.fastq.gz \
-S SRX1298579_bowtie2_local.sam
The output should look like this (why did we use time
in front of the command?):
SRX1298579_bowtie2_local.sam
1982546 reads; of these:
1982546 (100.00%) were paired; of these:
141501 (7.14%) aligned concordantly 0 times
1790110 (90.29%) aligned concordantly exactly 1 time
50935 (2.57%) aligned concordantly >1 times
----
141501 pairs aligned concordantly 0 times; of these:
56339 (39.82%) aligned discordantly 1 time
----
85162 pairs aligned 0 times concordantly or discordantly; of these:
170324 mates make up the pairs; of these:
152438 (89.50%) aligned 0 times
11581 (6.80%) aligned exactly 1 time
6305 (3.70%) aligned >1 times
96.16% overall alignment rate
real 2m49.814s
user 21m40.782s
sys 0m23.227s
SAM is a human-readable format, you can inspect the file with:
less -S SRX1298579_bowtie2_local.sam
But, for visualization, BAM files are faster and more memory-efficient, we need to convert it with samtools
:
samtools view -@ 4 -bS SRX1298579_bowtie2_local.sam > SRX1298579_bowtie2_local.bam
Now, the BAM needs to be sorted:
samtools sort -@ 4 SRX1298579_bowtie2_local.bam -o SRX1298579_bowtie2_local.sorted.bam
Then (important) we can remove the SAM and non-sorted BAM files:
rm SRX1298579_bowtie2_local.sam SRX1298579_bowtie2_local.bam
Finally, the sorted BAM needs to be indexed:
samtools index SRX1298579_bowtie2_local.sorted.bam
After this, you should have these two files in your folder:
SRX1298579_bowtie2_local.sorted.bam
SRX1298579_bowtie2_local.sorted.bam.bai
bbmap.sh
indexes the genome "on the fly" and it doesn't need a separate step, we can directly map the reads in this case:
fq_folder=/home/curso/data/quality_trimmed
time bbmap.sh -Xmx8g threads=4 sam=1.3 trd ref=Fusarium_graminearum.fasta nodisk \
in=$fq_folder/SRX1298579_R#.fastq.gz \
out=SRX1298579_bbmap_global.sam
The output should look like this:
java ...
Version 38.16
Set threads to 4
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [Fusarium_graminearum.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]
Set genScaffoldInfo=true
Set genome to 1
Loaded Reference: 0.005 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 6.949 seconds.
Analyzed Index: 4.317 seconds.
Started output stream: 0.521 seconds.
Cleared Memory: 0.214 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 3965092 (388492256 bases)
Mapping: 135.458 seconds.
Reads/sec: 29271.71
kBases/sec: 2867.99
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 89.8078% 1780481 89.6562% 348307500
bad pairs: 4.1190% 81661 4.0895% 15887448
insert size avg: 190.59
Read 1 data: pct reads num reads pct bases num bases
mapped: 94.9191% 1881815 94.9309% 184003660
unambiguous: 94.7210% 1877887 94.7333% 183620692
ambiguous: 0.1981% 3928 0.1976% 382968
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 69.3471% 1374838 69.5464% 134801108
semiperfect site: 69.3594% 1375083 69.5589% 134825471
rescued: 1.0938% 21685
Match Rate: NA NA 97.3925% 182522726
Error Rate: 26.9244% 506668 2.6046% 4881194
Sub Rate: 26.2610% 494184 0.7030% 1317509
Del Rate: 1.3212% 24863 1.8173% 3405773
Ins Rate: 1.5849% 29824 0.0843% 157912
N Rate: 0.0188% 353 0.0029% 5513
Read 2 data: pct reads num reads pct bases num bases
mapped: 94.7927% 1879308 94.8219% 184583201
unambiguous: 94.5938% 1875365 94.6252% 184200344
ambiguous: 0.1989% 3943 0.1967% 382857
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 67.6785% 1341757 67.9388% 132251875
semiperfect site: 67.6925% 1342035 67.9530% 132279405
rescued: 0.9496% 18826
Match Rate: NA NA 96.6448% 182851936
Error Rate: 28.5852% 537205 3.3518% 6341644
Sub Rate: 27.8293% 523000 0.7969% 1507665
Del Rate: 1.6644% 31279 2.4401% 4616740
Ins Rate: 2.0030% 37643 0.1148% 217239
N Rate: 0.0221% 415 0.0034% 6361
Total time: 148.264 seconds.
real 2m28.642s
user 17m50.183s
sys 0m14.003s
Then, we can process the SAM file as we did for bowtie2
, please execute all the following steps (changing filenames according to your own sample):
samtools view -@ 4 -bS SRX1298579_bbmap_global.sam > SRX1298579_bbmap_global.bam
samtools sort -@ 4 SRX1298579_bbmap_global.bam -o SRX1298579_bbmap_global.sorted.bam
rm SRX1298579_bbmap_global.bam
samtools index SRX1298579_bbmap_global.sorted.bam
If successful, you should have these files in your folder:
SRX1298579_bbmap_global.sorted.bam
SRX1298579_bbmap_global.sorted.bam.bai
Type the following command, and explain the options we will use in the analysis:
qualimap bamqc
Now, execute the commands on the sorted BAM files:
qualimap bamqc -nt 4 -c -outformat PDF:HTML -bam SRX1298579_bbmap_global.sorted.bam
qualimap bamqc -nt 4 -c -outformat PDF:HTML -bam SRX1298579_bowtie2_local.sorted.bam
Create the index (.fai) file to the reference
samtools faidx Fusarium_graminearum.fasta
Everything is ready to start visualizing our results!
Using WinSCP
, rsync
or scp
, transfer the following files and folders to your computer:
SRX1298579_bbmap_global.sorted_stats
SRX1298579_bowtie2_local.sorted_stats
Fusarium_graminearum.fai
Fusarium_graminearum.fasta
Fusarium_graminearum.gff
SRX1298579_bbmap_global.sorted.bam
SRX1298579_bbmap_global.sorted.bam.bai
SRX1298579_bowtie2_local.sorted.bam
SRX1298579_bowtie2_local.sorted.bam.bai
The folders ending in .sorted_stats
contain the reports from qualimap
, explore them and compare/explain according to what we saw in the lecture.
You also have the required files to visualize the mapping results in IGV
. Download and install IGV
on your computer from: https://software.broadinstitute.org/software/igv/download
Then take a look at this video: https://www.youtube.com/watch?v=E_G8z_2gTYM for general instructions on IGV
. Instead of using the pre-packaged Human genome, you need to load the reference genome used for mapping. From the Genome
menu select Load Genome from File...
and navigate to where you have the file Fusarium_graminearum.fasta
. Then, from the File
menu select Load from File...
and you can select both .bam
files to be seen side by side. Once you have both .bam
files loaded with their respective genome, explore the options and focus on the differences in mapping between bowtie2
and bbmap
, make some screenshots of such differences and try to explain them based on the features of each program and the parameters with which they were run.
Finally, try loading the annotation track Fusarium_graminearum.gff
, it can be done through File/Load from file...
. Zoom-in, explore the .bam
try to locate mutations (SNPs) that occur within the annotated genes.
Useful IGV videos
Navigation Basics
Sequencing Data Basics
View SNPs
- Flag all but one of a duplicate set as duplicates in a BAM file.
Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair.
This allows downstream processes to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing fortuitous biological duplicates, allows us to be conservative in calculating the confidence of variants (from GATK forum ).
Basic command:
sambamba markdup -t 4 SRX1298579_bowtie2_local.sorted.bam SRX1298579_local.markdup.bam
Q1. How many reads are marked as duplicates from the sorted BAM files?
Hint 1: View the standard output
Hint 2: Filter reads by their flag bits using the samtools flagstat
option:
samtools view -c -f 1024 SRX1298579_local.markdup.bam
Tags assigned to mainly differentiate samples, but also various technical features that are associated with potential artifacts. With this information is possible to assess the effects of those artifacts during the duplicate marking and base recalibration steps. Meaning of the read group fields
samtools addreplacerg -@ 4 -r ID:SRX1298579 -r LB:L1 -r SM:SRX1298579 -r PL:illumina -o SRX1298579_local.RG.bam SRX1298579_local.markdup.bam
Check the read groups in your bam
file. What is the -H option in samtools
for?
samtools view -H SRX1298579_local.RG.bam | grep "@RG"
You will get the header line of the bam file describing the Read Groups
@RG ID:SRX1298579 LB:lib1 PL:illumina SM:SRX1298580 PU:lane1
The probability that a read is mapped incorrectly is in the function of:
- uniqueness (i.e. not a multi-mapper)
- number of mismatches
- number of indels
- quality of bases
- Analyze the CIGAR string for INDELS (From https://wikis.utexas.edu/display/CoreNGSTools/Filtering+with+SAMTools)
Suppose you want to know how many alignments included insertions or deletions (INDELS) versus the reference.
Looking at the CIGAR string definition, insertions are denoted with the character I
and deletions with the character D
.
Use this information to count the number of INDELS.
Start by just looking at the first few alignment records of the BAM file:
samtools view SRX1298579_local.RG.bam | head
With all the line wrapping, it looks pretty ugly. You can use instead:
samtools view SRX1298579_local.RG.bam | head -50 | column -t | less -S
Now, you can pick out the CIGAR string (column 6 in the SAM format).
samtools view SRX1298579_local.RG.bam | cut -f 6 | head
Next, make sure you're only looking at alignment records that represent mapped reads using the Flagstats.
The -F 0x4
option says to filter records where the 0x4 flag (read unmapped) is 0, resulting in only mapped reads being output.
samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | head
Now, use grep with the pattern [ID]
to select lines (which are now just the CIGAR strings of the reads) that have Insert or Deletion operators.
The brackets ( [ ]
) denote a character class pattern, which matches any of the characters inside the brackets.
Be sure to ask for Perl regular expressions (-P option) to be sure that this character class syntax is recognized.
samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | grep -P '[ID]' | head
Now, you're ready to run this against all the alignment records, and count the results: Count reads that mapped with INDELS
samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | grep -P '[ID]' | wc -l
How many records did you find?
What is that in terms of the rate of INDELS?
For that, we need to count the total number of mapped reads.
Here you just need to use the -c (count only) option to samtools view
.
samtools view -c -F 0x4 SRX1298579_local.RG.bam
Knowing the total amount of mapped reads and how many of them have INDELS you can just divide the numbers to know the rate of INDELS.
- Plot mapping coverage Get depth table with Samtools and plot it with R
samtools depth SRX1298579_local.RG.bam > SRX1298579_local.coverage.txt
Run in R
SRX1298579_cov <- read.table("SRX1298579_local.coverage.txt", header=FALSE, sep="\t", na.strings="NA")
colnames(SRX1298579_cov) <- c("Chr", "locus", "depth")
pdf("SRX1298579_depthScan.pdf")
plot(SRX1298579_cov$locus, SRX1298579_cov$depth, pch=19, cex=0.3, col="red")
dev.off()