Skip to content

06. Short Read Alignment a.k.a Mapping

Gustavo A. Silva-Arias edited this page Aug 27, 2021 · 6 revisions

We will use the same samples from Practical 3. For your convenience, I placed the quality-trimmed reads in folder /home/curso/data/quality_trimmed. The objective of this exercise is to map ONE of these samples to the genome of Fusarium graminearum GenBank accession: GCA_900044135.1

1. Prepare a folder with all the necessary files:

Create a folder for today's exercise and move to the new folder:

mkdir mapping
cd mapping

Now, copy the reference genome file to your current location:

cp /home/curso/data/ref_genome/GCA_900044135.1_GZPH1RResV1_genomic.fna ./Fusarium_graminearum.fasta
cp /home/curso/data/ref_genome/GCA_900044135.1_GZPH1RResV1_genomic.gff Fusarium_graminearum.gff

Take a look at the FASTA file:

less -S Fusarium_graminearum.fasta

2. Local mapping with bowtie2:

First we need to index the genome (check bowtie2's manual for the options):

bowtie2-build Fusarium_graminearum.fasta Fusarium_graminearum.fasta

Check for newly created files with ls:

Fusarium_graminearum.fasta.1.bt2
Fusarium_graminearum.fasta.2.bt2
Fusarium_graminearum.fasta.3.bt2
Fusarium_graminearum.fasta.4.bt2
Fusarium_graminearum.fasta.rev.1.bt2
Fusarium_graminearum.fasta.rev.2.bt2

Now, let's map the reads to the reference:

fq_folder=/home/curso/data/quality_trimmed

time bowtie2 -p 4 --local -x Fusarium_graminearum.fasta \
-1 $fq_folder/SRX1298579_R1.fastq.gz -2 $fq_folder/SRX1298579_R2.fastq.gz \
-S SRX1298579_bowtie2_local.sam

The output should look like this (why did we use time in front of the command?):

SRX1298579_bowtie2_local.sam
1982546 reads; of these:
  1982546 (100.00%) were paired; of these:
    141501 (7.14%) aligned concordantly 0 times
    1790110 (90.29%) aligned concordantly exactly 1 time
    50935 (2.57%) aligned concordantly >1 times
    ----
    141501 pairs aligned concordantly 0 times; of these:
      56339 (39.82%) aligned discordantly 1 time
    ----
    85162 pairs aligned 0 times concordantly or discordantly; of these:
      170324 mates make up the pairs; of these:
        152438 (89.50%) aligned 0 times
        11581 (6.80%) aligned exactly 1 time
        6305 (3.70%) aligned >1 times
96.16% overall alignment rate

real	2m49.814s
user	21m40.782s
sys	0m23.227s

SAM is a human-readable format, you can inspect the file with:

less -S SRX1298579_bowtie2_local.sam

But, for visualization, BAM files are faster and more memory-efficient, we need to convert it with samtools:

samtools view -@ 4 -bS SRX1298579_bowtie2_local.sam > SRX1298579_bowtie2_local.bam

Now, the BAM needs to be sorted:

samtools sort -@ 4 SRX1298579_bowtie2_local.bam -o SRX1298579_bowtie2_local.sorted.bam

Then (important) we can remove the SAM and non-sorted BAM files:

rm SRX1298579_bowtie2_local.sam SRX1298579_bowtie2_local.bam

Finally, the sorted BAM needs to be indexed:

samtools index SRX1298579_bowtie2_local.sorted.bam

After this, you should have these two files in your folder:

SRX1298579_bowtie2_local.sorted.bam
SRX1298579_bowtie2_local.sorted.bam.bai

3. Global mapping with bbmap.sh:

bbmap.sh indexes the genome "on the fly" and it doesn't need a separate step, we can directly map the reads in this case:

fq_folder=/home/curso/data/quality_trimmed

time bbmap.sh -Xmx8g threads=4 sam=1.3 trd ref=Fusarium_graminearum.fasta nodisk \
in=$fq_folder/SRX1298579_R#.fastq.gz \
out=SRX1298579_bbmap_global.sam

The output should look like this:

java ...
Version 38.16

Set threads to 4
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [Fusarium_graminearum.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference:	0.005 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index:	6.949 seconds.
Analyzed Index:   	4.317 seconds.
Started output stream:	0.521 seconds.
Cleared Memory:    	0.214 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

   ------------------   Results   ------------------

Genome:                	1
Key Length:            	13
Max Indel:             	16000
Minimum Score Ratio:  	0.56
Mapping Mode:         	normal
Reads Used:           	3965092	(388492256 bases)

Mapping:          	135.458 seconds.
Reads/sec:       	29271.71
kBases/sec:      	2867.99


Pairing data:   	pct pairs	num pairs 	pct bases	   num bases

mated pairs:     	 89.8078% 	  1780481 	 89.6562% 	   348307500
bad pairs:       	  4.1190% 	    81661 	  4.0895% 	    15887448
insert size avg: 	  190.59


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 94.9191% 	  1881815 	 94.9309% 	   184003660
unambiguous:     	 94.7210% 	  1877887 	 94.7333% 	   183620692
ambiguous:       	  0.1981% 	     3928 	  0.1976% 	      382968
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	 69.3471% 	  1374838 	 69.5464% 	   134801108
semiperfect site:	 69.3594% 	  1375083 	 69.5589% 	   134825471
rescued:         	  1.0938% 	    21685

Match Rate:      	      NA 	       NA 	 97.3925% 	   182522726
Error Rate:      	 26.9244% 	   506668 	  2.6046% 	     4881194
Sub Rate:        	 26.2610% 	   494184 	  0.7030% 	     1317509
Del Rate:        	  1.3212% 	    24863 	  1.8173% 	     3405773
Ins Rate:        	  1.5849% 	    29824 	  0.0843% 	      157912
N Rate:          	  0.0188% 	      353 	  0.0029% 	        5513


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 94.7927% 	  1879308 	 94.8219% 	   184583201
unambiguous:     	 94.5938% 	  1875365 	 94.6252% 	   184200344
ambiguous:       	  0.1989% 	     3943 	  0.1967% 	      382857
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	 67.6785% 	  1341757 	 67.9388% 	   132251875
semiperfect site:	 67.6925% 	  1342035 	 67.9530% 	   132279405
rescued:         	  0.9496% 	    18826

Match Rate:      	      NA 	       NA 	 96.6448% 	   182851936
Error Rate:      	 28.5852% 	   537205 	  3.3518% 	     6341644
Sub Rate:        	 27.8293% 	   523000 	  0.7969% 	     1507665
Del Rate:        	  1.6644% 	    31279 	  2.4401% 	     4616740
Ins Rate:        	  2.0030% 	    37643 	  0.1148% 	      217239
N Rate:          	  0.0221% 	      415 	  0.0034% 	        6361

Total time:     	148.264 seconds.

real	2m28.642s
user	17m50.183s
sys	0m14.003s

Then, we can process the SAM file as we did for bowtie2, please execute all the following steps (changing filenames according to your own sample):

samtools view -@ 4 -bS SRX1298579_bbmap_global.sam > SRX1298579_bbmap_global.bam
samtools sort -@ 4 SRX1298579_bbmap_global.bam -o SRX1298579_bbmap_global.sorted.bam
rm SRX1298579_bbmap_global.bam
samtools index SRX1298579_bbmap_global.sorted.bam

If successful, you should have these files in your folder:

SRX1298579_bbmap_global.sorted.bam
SRX1298579_bbmap_global.sorted.bam.bai

4. Mapping quality evaluation with qualimap:

Type the following command, and explain the options we will use in the analysis:

qualimap bamqc

Now, execute the commands on the sorted BAM files:

qualimap bamqc -nt 4 -c -outformat PDF:HTML -bam SRX1298579_bbmap_global.sorted.bam
qualimap bamqc -nt 4 -c -outformat PDF:HTML -bam SRX1298579_bowtie2_local.sorted.bam

Create the index (.fai) file to the reference

samtools faidx Fusarium_graminearum.fasta

Everything is ready to start visualizing our results!
Using WinSCP, rsync or scp, transfer the following files and folders to your computer:

SRX1298579_bbmap_global.sorted_stats
SRX1298579_bowtie2_local.sorted_stats
Fusarium_graminearum.fai
Fusarium_graminearum.fasta
Fusarium_graminearum.gff
SRX1298579_bbmap_global.sorted.bam
SRX1298579_bbmap_global.sorted.bam.bai
SRX1298579_bowtie2_local.sorted.bam
SRX1298579_bowtie2_local.sorted.bam.bai

The folders ending in .sorted_stats contain the reports from qualimap, explore them and compare/explain according to what we saw in the lecture.
You also have the required files to visualize the mapping results in IGV. Download and install IGV on your computer from: https://software.broadinstitute.org/software/igv/download

Then take a look at this video: https://www.youtube.com/watch?v=E_G8z_2gTYM for general instructions on IGV. Instead of using the pre-packaged Human genome, you need to load the reference genome used for mapping. From the Genome menu select Load Genome from File... and navigate to where you have the file Fusarium_graminearum.fasta. Then, from the File menu select Load from File... and you can select both .bam files to be seen side by side. Once you have both .bam files loaded with their respective genome, explore the options and focus on the differences in mapping between bowtie2 and bbmap, make some screenshots of such differences and try to explain them based on the features of each program and the parameters with which they were run.

Finally, try loading the annotation track Fusarium_graminearum.gff, it can be done through File/Load from file.... Zoom-in, explore the .bam try to locate mutations (SNPs) that occur within the annotated genes.

Useful IGV videos
Navigation Basics
Sequencing Data Basics
View SNPs

5. Post-mapping processing:

5.1 Mark duplicates

  • Flag all but one of a duplicate set as duplicates in a BAM file. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair.
    This allows downstream processes to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing fortuitous biological duplicates, allows us to be conservative in calculating the confidence of variants (from GATK forum ).

Basic command:

sambamba markdup -t 4 SRX1298579_bowtie2_local.sorted.bam SRX1298579_local.markdup.bam

Exercise

Q1. How many reads are marked as duplicates from the sorted BAM files?
Hint 1: View the standard output
Hint 2: Filter reads by their flag bits using the samtools flagstat option:

samtools view -c -f 1024 SRX1298579_local.markdup.bam

5.2 Add / Replace read groups

Tags assigned to mainly differentiate samples, but also various technical features that are associated with potential artifacts. With this information is possible to assess the effects of those artifacts during the duplicate marking and base recalibration steps. Meaning of the read group fields

samtools addreplacerg -@ 4 -r ID:SRX1298579 -r LB:L1 -r SM:SRX1298579 -r PL:illumina -o SRX1298579_local.RG.bam SRX1298579_local.markdup.bam

Check the read groups in your bam file. What is the -H option in samtools for?

samtools view -H SRX1298579_local.RG.bam | grep "@RG"

You will get the header line of the bam file describing the Read Groups

@RG	ID:SRX1298579	LB:lib1	PL:illumina	SM:SRX1298580	PU:lane1

5.3 Post-mapping Quality check:

The probability that a read is mapped incorrectly is in the function of:

  • uniqueness (i.e. not a multi-mapper)
  • number of mismatches
  • number of indels
  • quality of bases

Suppose you want to know how many alignments included insertions or deletions (INDELS) versus the reference. Looking at the CIGAR string definition, insertions are denoted with the character I and deletions with the character D. Use this information to count the number of INDELS.

Start by just looking at the first few alignment records of the BAM file:

samtools view SRX1298579_local.RG.bam | head

With all the line wrapping, it looks pretty ugly. You can use instead:

samtools view SRX1298579_local.RG.bam | head -50 | column -t | less -S

Now, you can pick out the CIGAR string (column 6 in the SAM format).

samtools view SRX1298579_local.RG.bam | cut -f 6 | head

Next, make sure you're only looking at alignment records that represent mapped reads using the Flagstats. The -F 0x4 option says to filter records where the 0x4 flag (read unmapped) is 0, resulting in only mapped reads being output.

samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | head

Now, use grep with the pattern [ID] to select lines (which are now just the CIGAR strings of the reads) that have Insert or Deletion operators. The brackets ( [ ] ) denote a character class pattern, which matches any of the characters inside the brackets. Be sure to ask for Perl regular expressions (-P option) to be sure that this character class syntax is recognized.

samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | grep -P '[ID]' | head

Now, you're ready to run this against all the alignment records, and count the results: Count reads that mapped with INDELS

samtools view -F 0x4 SRX1298579_local.RG.bam | cut -f 6 | grep -P '[ID]' | wc -l

How many records did you find?

What is that in terms of the rate of INDELS?

For that, we need to count the total number of mapped reads. Here you just need to use the -c (count only) option to samtools view.

samtools view -c -F 0x4 SRX1298579_local.RG.bam

Knowing the total amount of mapped reads and how many of them have INDELS you can just divide the numbers to know the rate of INDELS.

  • Plot mapping coverage Get depth table with Samtools and plot it with R
samtools depth SRX1298579_local.RG.bam > SRX1298579_local.coverage.txt

Run in R

SRX1298579_cov <- read.table("SRX1298579_local.coverage.txt", header=FALSE, sep="\t", na.strings="NA")

colnames(SRX1298579_cov) <- c("Chr", "locus", "depth")
pdf("SRX1298579_depthScan.pdf")
plot(SRX1298579_cov$locus, SRX1298579_cov$depth, pch=19, cex=0.3, col="red")
dev.off()