Skip to content
samtools edited this page Apr 5, 2012 · 5 revisions

1. Between single- and multi-sample variant calling, which is preferred?

By using multi-sample calling, we gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for variant calling from one sample. If we call SNPs from 100 samples separately and then combine the calls, the FPR would be around 10-20% (not 100% because more SNPs are found given 100 samples). To retain an acceptable FPR on singletons, we have to be more stringent on each sample and thus lose power. Combining single-sample calls naively would not increase power on shared SNPs. This is where multi-sample calling does better: by taking the advantage of correlation between samples, we are able to call a SNP if it appears in multiple samples, but too weak to call in each sample individually. Joint calling is particularly preferable if we have multiple low-coverage samples for which single-sample calling does not work well. It is also able to reveal some artifacts only detectable with many samples.

In all, if you have deep coverage and need to study each sample separately, you should use single-sample calling. If you have low-coverage data or only care about variants from multiple samples as a whole, you should use multi-sample calling. Understanding the difference between single- and multi-sample calling also helps experimental design: if you only want to get a set of SNPs from many samples or to do association studies, sequencing to deep coverage is a waste. You pay much more only to get marginal reward.

2. "Samtools mpileup" reports bases with quality lower than the "-Q" value. Is it a bug?

The "-Q" option in Samtools-0.1.18 and prior sets a threshold for SNP calling, but it does not change the "mpileup" output. The version in git has changed this behavior. The latest version only outputs bases with high BAQ-adjusted base quality.

3. In the pileup output, the base quality is occasionally smaller than the read quality in BAM. Bug?

By default, samtools computes Base Alignment Quality (BAQ) for each base. The quality column in pileup gives the minimum between BAQ and the original base quality, which may be lower than the base quality if the base is likely to be affected by indels nearby. It is a feature.

4. When using "samtools mpileup -r" to specify a region, the output sometimes differs from that without "-r". Bug?

When mpileup switches from a reference sequence to the next sequence, there is a small delay in reference loading. The first few alignments are processed without the right reference in place and in this case, BAQ cannot be applied to these sequences. This is not an issue when a region is specified for recent samtools (version 0.1.17 or later). Thus the differences with/without "-r" is caused by BAQ. Only the first few alignments on each reference sequence are affected and the effect is largely negligible.

For samtools 0.1.16 or prior, BAQ is not applied to the first few alignments overlapping the region specified with "-r". It is advised to use a more recent version.