forked from FredHutch/tfcb_2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture16_Rsamtools.Rmd
140 lines (111 loc) · 4.91 KB
/
Lecture16_Rsamtools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "MCB536 Lecture 16 (Part 2): Sequence Data Analysis in R"
author: "Gavin Ha"
date: "11/30/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Loading and querying a BAM file using Rsamtools
The BAM file is the primary input for Rsamtools. There are two initial steps:
a. Define the genomic coordinates and components to query (`ScanBamParam`)
b. Scan the BAM file (`scanBam`)
For this tutorial, we will be using the same data and example from Lecture 15: Slides 16-24.
The BAM file can be downloaded at https://www.dropbox.com/home/File%20requests/TFCB_tutorials
For more information, refer to https://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf
## 0 Install and load the `Rsamtools` Bioconductor package
Bioconductor package providing functions to interface with aligned BAM files.
https://bioconductor.org/packages/release/bioc/html/Rsamtools.html
```{r, message=FALSE}
#BiocManager::install("Rsamtools")
library(Rsamtools)
```
## 1. Setup parameters for scanning BAM file
### a. Specify the genomic location of interest to query in the BAM file.
This will make use of two of the packages (`GRanges`) that we will describe in more detail later.
```{r}
whichRanges <- GRanges(seqnames = "17",
IRanges(start = 37844393, end = 37844393))
```
### b. Specify which fields to return in the query.
To find out the default fields to return, use `scanBamWhat()`
```{r}
whatFields <- scanBamWhat()
```
### c. Specify the filters to use to include or exclude reads.
This is an essential concept in analyzing sequence data. First, specify the status of the reads based on the `FLAG` (recall Lecture 15: Slide 22).
For more details, use `?scanBamFlag`
```{r}
flag <- scanBamFlag(isDuplicate = FALSE) # exclude PCR duplicate reads
```
Next, specify additional filters to use including `mapqFilter`, `tagFilter`. These are included in the final `scanBamParam` object instantiation, along with all the previous arguments.
```{r}
param <- ScanBamParam(flag = flag, which = whichRanges, what = whatFields,
mapqFilter = 30, tag = c("RG"))
param
```
## 2. Query the BAM file
Using the params we just defined, we will query the BAM file `BRCA_IDC_cfDNA.bam`.
```{r}
bamFile <- "BRCA_IDC_cfDNA.bam"
bam <- scanBam(bamFile, param = param)
```
This returns a `list` object with each element representing a read. For each element/read, there is another `list` with the fields in the BAM file we requested with `scanBamWhat()`. Here is a breakdown of what is in the first read.
Refer to Lecture 15: Slides 22.
```{r}
bam[[1]]$qname # reqd query name
bam[[1]]$flag # bitwise flag describing the read alignment
bam[[1]]$rname # reference sequence name
bam[[1]]$pos # position of aligned read (leftmost coordinate)
bam[[1]]$mapq # mapping quality of the read alignment
bam[[1]]$cigar # CIGAR string
bam[[1]]$mrnm # mate read's reference sequence name
bam[[1]]$mpos # mate read's aligned position
bam[[1]]$isize # insert size or templent length; aka fragment size
as.character(bam[[1]]$seq) # sequence of mapped reads on forward strand
as.character(bam[[1]]$qual) # base qualities of the sequence aligment
bam[[1]]$tag # value for the tag we specified
```
## Exercise 2: Extract sequence data information
### a. Create a range for `11:69462758-69462758`.
```{r}
# GRanges()
```
### b. Specify the BAM query parameters.
```{r}
# scanBamWhat()
# scanBamFlag()
# ScanBamParam()
```
### c. What is the sequence of the read at `11:69462758-69462758`?
```{r}
# scanBam()
```
# 2. Compute "Pile-Up" Statistics
**For your reference**
The pileup is a term referring to counting the alleles from all the reads at a given genomic locus. It is the data that many variant and mutation calling algorithms use to determine variant status and allelic fractions.
There are 3 steps:
a. Define the genomic coordinates and read components to query (`ScanBamParam`) - same as before
b. Define the pileup-specific parameters, such as filters (`PileupParam`)
c. Run the `pileup` command
### a. Set up the pileup parameters: `PileupParam`.
The `PileupParam()` function will allow for specifying criteria such as minimum read depth,
https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/pileup
```{r}
pu.param <- PileupParam() # default settings
```
### b. Set up scanBam parameters: `ScanBamParam`.
Let's try generating the pileup for `17:37883255-37883260`.
```{r}
whichRanges2 <- GRanges(seqnames = "17",
IRanges(start = 37883255, end = 37883260))
param <- ScanBamParam(flag = flag, which = whichRanges2,
what = whatFields, mapqFilter = 30)
```
### c. Generate the pipeline at `17:37883255-37883260`.
The `pileup` command outputs a `data.frame` object containing the counts for each allele at every base specified in `param`.
```{r}
pu <- pileup(file = bamFile, scanBamParam = param, pileupParam = pu.param)
pu
```