forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path14-exprs-qc.Rmd
352 lines (262 loc) · 12.3 KB
/
14-exprs-qc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
output: html_document
---
# Cleaning the Expression Matrix
## Expression QC (UMI) {#exprs-qc}
### Introduction
Once gene expression has been quantified it is summarized as an __expression matrix__ where each row corresponds to a gene (or transcript) and each column corresponds to a single cell. This matrix should be examined to remove poor quality cells which were not detected in either read QC or mapping QC steps. Failure to remove low quality cells at this
stage may add technical noise which has the potential to obscure
the biological signals of interest in the downstream analysis.
Since there is currently no standard method for performing scRNASeq the expected values for the various QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to perform QC we will be looking for cells which are outliers with respect to the rest of the dataset rather than comparing to independent quality standards. Consequently, care should be taken when comparing quality metrics across datasets collected using different protocols.
### Tung dataset
To illustrate cell QC, we consider a
[dataset](http://jdblischak.github.io/singleCellSeq/analysis/) of
induced pluripotent stem cells generated from three different individuals [@Tung2017-ba] in [Yoav Gilad](http://giladlab.uchicago.edu/)'s lab at the
University of Chicago. The experiments were carried out on the
Fluidigm C1 platform and to facilitate the quantification both unique
molecular identifiers (UMIs) and ERCC _spike-ins_ were used. The data files are located in the `tung` folder in your working directory. These files are the copies of the original files made on the 15/03/16. We will use these copies for reproducibility purposes.
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(out.width='90%', fig.align = 'center')
```
```{r, message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
```
Load the data and annotations:
```{r}
molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
```
Inspect a small portion of the expression matrix
```{r}
head(molecules[ , 1:3])
head(anno)
```
The data consists of `r length(unique(anno$individual))` individuals and `r length(unique(anno$replicate))` replicates and therefore has `r length(unique(anno$batch))` batches in total.
We standardize the analysis by using both `SingleCellExperiment` (SCE) and `scater` packages. First, create the SCE object:
```{r}
umi <- SingleCellExperiment(
assays = list(counts = as.matrix(molecules)),
colData = anno
)
```
Remove genes that are not expressed in any cell:
```{r}
keep_feature <- rowSums(counts(umi) > 0) > 0
umi <- umi[keep_feature, ]
```
Define control features (genes) - ERCC spike-ins and mitochondrial genes ([provided](http://jdblischak.github.io/singleCellSeq/analysis/qc-filter-ipsc.html) by the authors):
```{r}
isSpike(umi, "ERCC") <- grepl("^ERCC-", rownames(umi))
isSpike(umi, "MT") <- rownames(umi) %in%
c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
"ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
"ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
"ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
"ENSG00000198840")
```
Calculate the quality metrics:
```{r}
umi <- calculateQCMetrics(
umi,
feature_controls = list(
ERCC = isSpike(umi, "ERCC"),
MT = isSpike(umi, "MT")
)
)
```
### Cell QC
#### Library size
Next we consider the total number of RNA molecules detected per
sample (if we were using read counts rather than UMI counts this would
be the total number of reads). Wells with few reads/molecules are likely to have been broken or failed to capture a cell, and should thus be removed.
```{r total-counts-hist, fig.cap = "Histogram of library sizes for all cells"}
hist(
umi$total_counts,
breaks = 100
)
abline(v = 25000, col = "red")
```
__Exercise 1__
1. How many cells does our filter remove?
2. What distribution do you expect that the
total number of molecules for each cell should follow?
__Our answer__
```{r, echo=FALSE}
filter_by_total_counts <- (umi$total_counts > 25000)
table(filter_by_total_counts)
```
#### Detected genes
In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each sample.
```{r total-features-hist, fig.cap = "Histogram of the number of detected genes in all cells"}
hist(
umi$total_features,
breaks = 100
)
abline(v = 7000, col = "red")
```
From the plot we conclude that most cells have between 7,000-10,000 detected genes,
which is normal for high-depth scRNA-seq. However, this varies by
experimental protocol and sequencing depth. For example, droplet-based methods
or samples with lower sequencing-depth typically detect fewer genes per cell. The most notable feature in the above plot is the __"heavy tail"__ on the left hand side of the
distribution. If detection rates were equal across the cells then the
distribution should be approximately normal. Thus we remove those
cells in the tail of the distribution (fewer than 7,000 detected genes).
__Exercise 2__
How many cells does our filter remove?
__Our answer__
```{r, echo=FALSE}
filter_by_expr_features <- (umi$total_features > 7000)
table(filter_by_expr_features)
```
#### ERCCs and MTs
Another measure of cell quality is the ratio between ERCC _spike-in_
RNAs and endogenous RNAs. This ratio can be used to estimate the total amount
of RNA in the captured cells. Cells with a high level of _spike-in_ RNAs
had low starting amounts of RNA, likely due to the cell being
dead or stressed which may result in the RNA being degraded.
```{r mt-vs-counts, fig.cap = "Percentage of counts in MT genes"}
plotColData(
umi,
x = "total_features",
y = "pct_counts_MT",
colour = "batch"
)
```
```{r ercc-vs-counts, fig.cap = "Percentage of counts in ERCCs"}
plotColData(
umi,
x = "total_features",
y = "pct_counts_ERCC",
colour = "batch"
)
```
The above analysis shows that majority of the cells from NA19098.r2 batch have a very high ERCC/Endo ratio. Indeed, it has been shown by the authors that this batch contains cells of smaller size.
__Exercise 3__
Create filters for removing batch NA19098.r2 and cells with high expression of mitochondrial genes (>10% of total counts in a cell).
__Our answer__
```{r, echo=FALSE}
filter_by_ERCC <- umi$batch != "NA19098.r2"
table(filter_by_ERCC)
filter_by_MT <- umi$pct_counts_MT < 10
table(filter_by_MT)
```
__Exercise 4__
What would you expect to see in the ERCC vs counts plot if you were examining a dataset containing cells of different sizes (eg. normal & senescent cells)?
__Answer__
You would expect to see a group corresponding to the smaller cells (normal) with a higher fraction of ERCC reads than a separate group corresponding to the larger cells (senescent).
### Cell filtering
#### Manual
Now we can define a cell filter based on our previous analysis:
```{r}
umi$use <- (
# sufficient features (genes)
filter_by_expr_features &
# sufficient molecules counted
filter_by_total_counts &
# sufficient endogenous RNA
filter_by_ERCC &
# remove cells with unusual number of reads in MT genes
filter_by_MT
)
```
```{r}
table(umi$use)
```
#### Automatic
Another option available in `scater` is to conduct PCA on a set of QC metrics and then use automatic outlier detection to identify potentially problematic cells.
By default, the following metrics are used for PCA-based outlier detection:
* __pct_counts_top_100_features__
* __total_features__
* __pct_counts_feature_controls__
* __n_detected_feature_controls__
* __log10_counts_endogenous_features__
* __log10_counts_feature_controls__
`scater` first creates a matrix where the rows represent cells and
the columns represent the different QC metrics. Then, outlier cells
can also be identified by using the `mvoutlier` package
on the QC metrics for all cells. This will identify cells that have
substantially different QC metrics from the others, possibly corresponding
to low-quality cells. We can visualize any outliers using a
principal components plot as shown below:
```{r auto-cell-filt, fig.align='center', fig.cap="PCA plot used for automatic detection of cell outliers", message=FALSE, warning=FALSE, out.width='90%'}
umi <- runPCA(umi, use_coldata = TRUE,
detect_outliers = TRUE)
reducedDimNames(umi)
```
Column subsetting can then be performed based on the `$outlier`
slot, which indicates whether or not each cell has been
designated as an outlier. Automatic outlier detection can
be informative, but a close inspection of QC metrics and
tailored filtering for the specifics of the dataset at
hand is strongly recommended.
```{r}
table(umi$outlier)
```
Then, we can use a PCA plot to see a 2D representation
of the cells ordered by their quality metrics.
```{r}
plotReducedDim(umi, use_dimred = "PCA_coldata",
size_by = "total_features", shape_by = "use",
colour_by="outlier")
```
### Compare filterings
__Exercise 5__
Compare the default, automatic and manual cell filters. Plot a Venn diagram of the outlier cells from these filterings.
__Hint__: Use `vennCounts` and `vennDiagram` functions from the [limma](https://bioconductor.org/packages/release/bioc/html/limma.html) package to make a Venn diagram.
__Answer__
```{r cell-filt-comp, fig.cap = "Comparison of the default, automatic and manual cell filters", echo=FALSE}
library(limma)
auto <- colnames(umi)[umi$outlier]
man <- colnames(umi)[!umi$use]
venn.diag <- vennCounts(
cbind(colnames(umi) %in% auto,
colnames(umi) %in% man)
)
vennDiagram(
venn.diag,
names = c("Automatic", "Manual"),
circle.col = c("blue", "green")
)
```
### Gene analysis
#### Gene expression
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
```{r top50-gene-expr, fig.cap = "Number of total counts consumed by the top 50 expressed genes", fig.asp = 1}
plotHighestExprs(umi, exprs_values = "counts")
```
The distributions are relatively flat indicating (but not guaranteeing!) good coverage of the full transcriptome of these cells. However, there are several spike-ins in the top 15 genes which suggests a greater dilution of the spike-ins may be preferrable if the experiment is to be repeated.
#### Gene filtering
It is typically a good idea to remove genes whose expression level is considered __"undetectable"__. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (__note__ `colData(umi)$use` filter applied to the `umi` dataset).
```{r}
keep_feature <- nexprs(umi[,colData(umi)$use], byrow=TRUE,
detection_limit=1) >= 2
rowData(umi)$use <- keep_feature
```
```{r}
table(keep_feature)
```
Depending on the cell-type, protocol and sequencing depth, other cut-offs may be appropriate.
### Save the data
Dimensions of the QCed dataset (do not forget about the gene filter we defined above):
```{r}
dim(umi[rowData(umi)$use, colData(umi)$use])
```
Let's create an additional slot with log-transformed counts (we will need it in the next chapters) and remove saved PCA results from the `reducedDim` slot:
```{r}
assay(umi, "logcounts_raw") <- log2(counts(umi) + 1)
reducedDim(umi) <- NULL
```
Save the data:
```{r}
saveRDS(umi, file = "tung/umi.rds")
```
### Big Exercise
Perform exactly the same QC analysis with read counts of the same Blischak data. Use `tung/reads.txt` file to load the reads. Once you have finished please compare your results to ours (next chapter).
### sessionInfo()
```{r echo=FALSE}
sessionInfo()
```