forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path26-dropouts.Rmd
399 lines (349 loc) · 13.9 KB
/
26-dropouts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
---
# knit: bookdown::preview_chapter
output: html_document
---
## Feature Selection
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center")
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(scRNA.seq.funcs)
library(matrixStats)
library(M3Drop)
library(RColorBrewer)
library(SingleCellExperiment)
set.seed(1)
```
Single-cell RNASeq is capable of measuring the expression of many
thousands of genes in every cell. However, in most situations only a
portion of those will show a response to the biological condition of
interest, e.g. differences in cell-type, drivers of differentiation,
respond to an environmental stimulus. Most genes detected in a scRNASeq
experiment will only be detected at different levels due to technical
noise. One consequence of this is that technical noise and batch
effects can obscure the biological signal of interest.
Thus, it is often advantageous to perform feature selection to remove
those genes which only exhibit technical noise from downstream analysis.
Not only does this generally increase the signal:noise ratio in the data;
it also reduces the computational complexity of analyses, by reducing
the total amount of data to be processed.
For scRNASeq data, we will be focusing on unsupervised methods of feature
selection which don't require any a priori information, such as cell-type
labels or biological group, since they are not available, or may be unreliable,
for many experiments. In contrast, differential expression (chapter \@ref(dechapter))
can be considered a form of supervised feature selection since it uses the
known biological label of each sample to identify features (i.e. genes) which
are expressed at different levels across groups.
For this section we will continue working with the Deng data.
```{r}
deng <- readRDS("deng/deng-reads.rds")
cellLabels <- colData(deng)$cell_type2
```
This data can be QCed and normalized for library size using M3Drop,
which removes cells with few detected genes, removes
undetected genes, and converts raw counts to CPM.
```{r}
deng_list <- M3DropCleanData(
counts(deng),
labels = cellLabels,
min_detected_genes = 100,
is.counts = TRUE
)
expr_matrix <- deng_list$data # Normalized & filtered expression matrix
celltype_labs <- factor(deng_list$labels) # filtered cell-type labels
cell_colors <- brewer.pal(max(3,length(unique(celltype_labs))), "Set3")
```
__Exercise 1__: How many cells & genes have been removed by this filtering?
```{r, include=FALSE}
ncol(counts(deng)) - ncol(deng_list$data)
```
### Identifying Genes vs a Null Model
There are two main approaches to unsupervised feature selection. The
first is to identify genes which behave differently from a null model
describing just the technical noise expected in the dataset.
If the dataset contains spike-in RNAs they can be used to directly model
technical noise. However, measurements of spike-ins may not experience
the same technical noise as endogenous transcripts [(Svensson et al., 2017)](https://www.nature.com/nmeth/journal/v14/n4/full/nmeth.4220.html).
In addition, scRNASeq experiments often contain only a small number of
spike-ins which reduces our confidence in fitted model parameters.
#### Highly Variable Genes
The first method proposed to identify features in scRNASeq datasets
was to identify highly variable genes (HVG). HVG assumes that if genes
have large differences in expression across cells some of those differences
are due to biological difference between the cells rather than technical noise.
However, because of the nature of count data, there is a positive relationship
between the mean expression of a gene and the variance in the read counts across
cells. This relationship must be corrected for to properly identify HVGs.
__Exercise 2__
Using the functions rowMeans and rowVars to plot the relationship between mean expression
and variance for all genes in this dataset. (Hint: use log="xy" to plot on a log-scale).
```{r, echo=FALSE, fig.width = 8.5, fig.height = 6}
plot(
rowMeans(expr_matrix),
rowVars(expr_matrix),
log="xy",
pch=16,
xlab="Mean Expression",
ylab="Variance",
main=""
)
```
A popular method to correct for the relationship between variance and mean expression
was proposed by [Brennecke et al.](http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2645.html).
To use the Brennecke method, we first normalize for library size then calculate
the mean and the square coefficient of variation (variation divided by
the squared mean expression). A quadratic curve is fit to the relationship
between these two variables for the ERCC spike-in, and then a chi-square test is used to find genes
significantly above the curve. This method is included in the M3Drop package as the
Brennecke_getVariableGenes(counts, spikes) function. However, this dataset does not contain spike-ins
so we will use the entire dataset to estimate the technical noise.
In the figure below the red curve
is the fitted technical noise model and the dashed line is the 95%
CI. Pink dots are the genes with significant biological variability
after multiple-testing correction.
```{r, fig.width = 7, fig.height = 6}
Brennecke_HVG <- BrenneckeGetVariableGenes(
expr_matrix,
fdr = 0.01,
minBiolDisp = 0.5
)
HVG_genes <- Brennecke_HVG$Gene
```
#### High Dropout Genes
An alternative to finding HVGs is to identify genes with unexpectedly high numbers of zeros.
The frequency of zeros, know as the "dropout rate", is very closely related to expression level
in scRNASeq data. Zeros are the dominant feature of single-cell RNASeq data, typically accounting
for over half of the entries in the final expression matrix. These zeros predominantly result
from the failure of mRNAs failing to be reversed transcribed [(Andrews and Hemberg, 2016)](http://www.biorxiv.org/content/early/2017/05/25/065094). Reverse transcription
is an enzyme reaction thus can be modelled using the Michaelis-Menten equation:
$$P_{dropout} = 1 - S/(K + S)$$
where $S$ is the mRNA concentration in the cell (we will estimate this as average expression)
and $K$ is the Michaelis-Menten constant.
Because the Michaelis-Menten equation is a convex non-linear function, genes which are
differentially expression across two or more populations of cells in our dataset will
be shifted up/right of the Michaelis-Menten model (see Figure below).
```{r, fig.width = 8.5, fig.height = 6, echo=TRUE}
K <- 49
S_sim <- 10^seq(from = -3, to = 4, by = 0.05) # range of expression values
MM <- 1 - S_sim / (K + S_sim)
plot(
S_sim,
MM,
type = "l",
lwd = 3,
xlab = "Expression",
ylab = "Dropout Rate",
xlim = c(1,1000)
)
S1 <- 10
P1 <- 1 - S1 / (K + S1) # Expression & dropouts for cells in condition 1
S2 <- 750
P2 <- 1 - S2 / (K + S2) # Expression & dropouts for cells in condition 2
points(
c(S1, S2),
c(P1, P2),
pch = 16,
col = "grey85",
cex = 3
)
mix <- 0.5 # proportion of cells in condition 1
points(
S1 * mix + S2 * (1 - mix),
P1 * mix + P2 * (1 - mix),
pch = 16,
col = "grey35",
cex = 3
)
```
__Note__: add `log="x"` to the `plot` call above to see how this looks on the log scale, which is used in M3Drop figures.
__Exercise 3__: Produce the same plot as above with different expression levels (S1 & S2) and/or mixtures (mix).
```{r, include=FALSE}
plot(
S_sim,
MM,
type = "l",
lwd = 3,
xlab = "Expression",
ylab = "Dropout Rate",
xlim = c(1, 1000),
log = "x"
)
S1 <- 100
P1 <- 1 - S1 / (K + S1) # Expression & dropouts for cells in condition 1
S2 <- 1000
P2 <- 1 - S2 / (K + S2) # Expression & dropouts for cells in condition 2
points(
c(S1, S2),
c(P1, P2),
pch = 16,
col = "grey85",
cex = 3
)
mix <- 0.75 # proportion of cells in condition 1
points(
S1 * mix + S2 * (1 - mix),
P1 * mix + P2 * (1 - mix),
pch = 16,
col = "grey35",
cex = 3
)
```
We use M3Drop to identify significant outliers to the right of the MM
curve. We also apply 1% FDR multiple testing correction:
```{r, fig.width = 7, fig.height = 6}
M3Drop_genes <- M3DropFeatureSelection(
expr_matrix,
mt_method = "fdr",
mt_threshold = 0.01
)
M3Drop_genes <- M3Drop_genes$Gene
```
An alternative method is contained in the M3Drop package that is tailored specifically for
UMI-tagged data which generally contains many zeros resulting from low sequencing coverage
in addition to those resulting from insufficient reverse-transcription. This model is the
Depth-Adjusted Negative Binomial (DANB). This method describes each expression observation
as a negative binomial model with a mean related to both the mean expression of the
respective gene and the sequencing depth of the respective cell, and a variance related to
the mean-expression of the gene.
Unlike the Michaelis-Menten and HVG methods there isn't a reliable statistical test for features
selected by this model, so we will consider the top 1500 genes instead.
```{r, fig.width=8, fig.height=5}
deng_int <- NBumiConvertToInteger(counts(deng))
DANB_fit <- NBumiFitModel(deng_int) # DANB is fit to the raw count matrix
# Perform DANB feature selection
DropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit)
DANB_genes <- names(DropFS[1:1500])
```
### Correlated Expression
A completely different approach to feature selection is to use gene-gene correlations. This method
is based on the idea that multiple genes will be differentially expressed between different cell-types
or cell-states. Genes which are expressed in the same cell-population will be positively correlated
with each other where as genes expressed in different cell-populations will be negatively correated with
each other. Thus important genes can be identified by the magnitude of their correlation
with other genes.
The limitation of this method is that it assumes technical noise is random and independent for each cell,
thus shouldn't produce gene-gene correlations, but this assumption is violated by batch effects which are
generally systematic between different experimental batches and will produce gene-gene correlations. As a
result it is more appropriate to take the top few thousand genes as ranked by gene-gene correlation than
consider the significance of the correlations.
```{r, eval=FALSE}
cor_mat <- cor(t(expr_matrix), method = "spearman") # Gene-gene correlations
diag(cor_mat) <- rep(0, times = nrow(expr_matrix))
score <- apply(cor_mat, 1, function(x) {max(abs(x))}) #Correlation of highest magnitude
names(score) <- rownames(expr_matrix);
score <- score[order(-score)]
Cor_genes <- names(score[1:1500])
```
Lastly, another common method for feature selection in scRNASeq data is to use PCA loadings. Genes with
high PCA loadings are likely to be highly variable and correlated with many other variable genes, thus
may be relevant to the underlying biology. However, as with gene-gene correlations PCA loadings tend to
be susceptible to detecting systematic variation due to batch effects; thus it is recommended to plot the PCA
results to determine those components corresponding to the biological variation rather than batch effects.
```{r, fig.width=7, fig.height=6}
# PCA is typically performed on log-transformed expression data
pca <- prcomp(log(expr_matrix + 1) / log(2))
# plot projection
plot(
pca$rotation[,1],
pca$rotation[,2],
pch = 16,
col = cell_colors[as.factor(celltype_labs)]
)
# calculate loadings for components 1 and 2
score <- rowSums(abs(pca$x[,c(1,2)]))
names(score) <- rownames(expr_matrix)
score <- score[order(-score)]
PCA_genes <- names(score[1:1500])
```
__Exercise 4__
Consider the top 5 principal components. Which appear to be most biologically relevant? How does the top 1,500
features change if you consider the loadings for those components?
```{r, include=FALSE}
plot(
pca$rotation[,2],
pca$rotation[,3],
pch = 16,
col = cell_colors[as.factor(celltype_labs)]
)
plot(
pca$rotation[,3],
pca$rotation[,4],
pch = 16,
col = cell_colors[as.factor(celltype_labs)]
)
# calculate loadings for components 1 and 2
score <- rowSums(abs(pca$x[,c(2, 3, 4)]))
names(score) <- rownames(expr_matrix)
score <- score[order(-score)]
PCA_genes2 = names(score[1:1500])
```
### Comparing Methods
We can check whether the identified features really do represent genes differentially expressed between
cell-types in this dataset.
```{r, fig.width = 7, fig.height = 10}
M3DropExpressionHeatmap(
M3Drop_genes,
expr_matrix,
cell_labels = celltype_labs
)
```
We can also consider how consistent each feature selection method is with the others using the Jaccard Index:
```{r}
J <- sum(M3Drop_genes %in% HVG_genes)/length(unique(c(M3Drop_genes, HVG_genes)))
```
__Exercise 5__
Plot the expression of the features for each of the other methods. Which appear to be differentially expressed? How consistent are the different methods for this dataset?
```{r, eval=FALSE, include=FALSE, fig.width = 7, fig.height = 10}
M3DropExpressionHeatmap(
HVG_genes,
expr_matrix,
cell_labels = celltype_labs
)
```
```{r, eval=FALSE, include=FALSE, fig.width = 7, fig.height = 10}
M3DropExpressionHeatmap(
Cor_genes,
expr_matrix,
cell_labels = celltype_labs
)
```
```{r, eval=FALSE, include=FALSE, fig.width = 7, fig.height = 10}
M3DropExpressionHeatmap(
PCA_genes,
expr_matrix,
cell_labels = celltype_labs
)
```
```{r, eval=FALSE, include=FALSE, fig.width = 7, fig.height = 10}
M3DropExpressionHeatmap(
PCA_genes2,
expr_matrix,
cell_labels = celltype_labs
)
```
```{r, eval=FALSE, include=FALSE}
list_of_features <- list(
M3Drop_genes,
HVG_genes,
Cor_genes,
PCA_genes,
PCA_genes2
)
Out <- matrix(
0,
ncol = length(list_of_features),
nrow = length(list_of_features)
)
for(i in 1:length(list_of_features) ) {
for(j in 1:length(list_of_features) ) {
Out[i,j] <- sum(list_of_features[[i]] %in% list_of_features[[j]])/
length(unique(c(list_of_features[[i]], list_of_features[[j]])))
}
}
colnames(Out) <- rownames(Out) <- c("M3Drop", "HVG", "Cor", "PCA", "PCA2")
```
### sessionInfo()
```{r echo=FALSE}
sessionInfo()
```