-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdiff_exp.html
395 lines (230 loc) · 10.8 KB
/
diff_exp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
layout: reveal_markdown
title: "Differential Expression Analysis"
tags: slides
date: 2022-01-22
---
### Transcriptome
<img src="images/diff_exp/transcriptome.png" width="900">
---
### Transcriptome
<img src="images/diff_exp/alternative_splicing.png" width="900">
---
### Gene Expression Analysis
1. Quality Control
2. Estimate Expression Level
3. Normalize Across Samples
4. Perform Differential Expression Anlaysis
5. Perform Enirched Pathway and Transcription Factor Analysis
---
## Microarray Analysis
1. Quality Control
2. Estimate Expression Level: RMA, GCRMA, dChip
3. Normalize Across Samples: Quantile Normalization, Scaling
4. Perform Differential Expression Anlaysis: limma
5. Perform Enirched Pathway and Transcription Factor Analysis
---
### Microarray Analysis
<img src="images/diff_exp/crop_circles.png" width="800">
---
### Microarray Analysis
<img src="images/diff_exp/ma_pre-norm.png" width="900">
<img src="images/diff_exp/ma_post-norm.png" width="900">
---
### Microarray Analysis
<img src="images/diff_exp/probe_effects.png" width="900">
---
### Microarray Analysis
1. dChip: Analyzes multiple chips simultaneously
2. For array $i$, probe $j$ and number of probe pairs $J$
3. $p_{ij} = PM_{ij} - MM_{ij} = \theta_{i} \phi_{j} + e_{ij}$
4. $\theta_{i}$: relative expression level; $\phi_{j}$: relative affinity
5. $\sum_{j=1}^{J} \phi_{j}^{2} = J$ (constraint)
6. $e_{ij} \sim N(0,\sigma^2)$: addative error model
7. Iteratively fit equations excluding outlier $\theta_{i}$ and $\phi_{j}$
8. Effective expression estimate $\theta_{i} = \frac{1}{J}\sum_{j} p_{ij} \phi_{j}$
---
### Microarray Analysis
1. Problems with dChip (Li-Wong model)
2. $\log(PM)$, $\log(MM)$ tend to be normally distributed
3. $MM$ tends to capture significant amount of intended target: lowers sensitivity
4. $MM$ introduces second "noisey" intensity: increases variance
---
### Microarray Analysis
1. Robust Multiarray Average (RMA) (Li-Wong on log-scale; no MM):
$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b + e_{ij}$
2. Estimate relative $\log_{2} \theta_{i}$ robustly: median polish
---
### Microarray Analysis
<img src="images/diff_exp/rma_background_model.png" width="1000">
---
### Microarray Analysis
1. GCRMA
2. Background strongly depends on probe sequence, so modeled as function of sequence ($S_{ij}$) of probes
3. RMA with sequence dependent background correction
$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b(S_{ij},MM_{ij}) + e_{ij}$
---
### Microarray Analysis
1. Linear Models for Microarray Data (limma)
2. Assume linear model: $E[\mathbf{y}\_{j}] = \mathbf{X} \alpha\_{j}$
3. $\mathbf{y}_{j}$ expression values for gene $j$
4. $\mathbf{X}$ is the design matrix
5. $\alpha_{j}$ is vector of coefficients
6. $\mathbf{y}^{T}_{j}$: jth row of expression matrix (log-intensities)
7. Contrasts: $\beta_{j} = \mathbf{C}^{T} \alpha_{j}$ ($\mathbf{C}$: contrast matix)
---
### Microarray Analysis
1. Significance analysis: moderated t-statistic
2. Uses borrowed information from ensemble of genes
3. Ordinary t-statistic with
1. standard errors shrunk toward common value
2. increased degrees of freedom (greater reliability associated with smoothed standard errors)
---
### Microarray Analysis
1. Linear model for gene $j$ has residual variance $\sigma_{j}^2$ with sample value $s_{j}^2$ and degrees of freedom $f_{j}$
2. Covariance matrix of estimated $\hat{\beta}\_{j}$ is $\sigma\_{j}^2 \mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}\_{j} \mathbf{X})^{-1} \mathbf{C}$
3. $\mathbf{V}\_{j}$ is a weight matrix: prior weights, covariance terms introduced by correlation strucuture and interative weights introduced by robust estimation
4. Unscaled standard deviation ($u_{jk}$): square roots of diagonal elements of $\mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}\_{j} \mathbf{X})^{-1} \mathbf{C}$
---
### Microarray Analysis
1. Ordinary t-statistic for kth contrast and gene $j$: $t_{jk} = \hat{\beta}\_{jk}/(u_{jk} s_{j})$
2. Empirical Bayes method assumes inverse Chi-square prior for the $\sigma_{j}^2$ with mean $s_{0}^2$ and degrees of freedom $f_{0}$
3. Posterior values for residual variances given by
$\tilde{s}\_{j}^2 = \frac{f_{0} s_{0}^{2} + f_{j} s_{j}^{2}}{f_{0} + f_{j}}$
---
### Microarray Analysis
1. Moderated t-statistic: $\tilde{t}\_{jk} = \frac{\hat{\beta}\_{jk}}{u_{jk} \tilde{s}\_{j}}$
2. Follows t-distribution with $f_{0} + f_{1}$ degrees of freedom if $\hat{\beta}\_{jk} = 0$
3. Extra degree of freedom $f_{0}$ represents borrowed information from ensemble of genes for each gene's inference
---
### Microarray vs RNA-SEQ
<img src="images/diff_exp/ma_rna-seq_signal_scatter.png" width="900">
---
### Microarray vs RNA-SEQ
<img src="images/diff_exp/ma_rna-seq_fc_scatter.png" width="900">
---
### Microarray vs RNA-SEQ
<img src="images/diff_exp/ma_rna-seq_venn.png" width="900">
---
### RNA-SEQ Analysis
1. Quality Control: FASTQC
2. Splice-Aware Alignment: HISAT, STAR
3. Estimate Expression Level: featureCounts in Rsubread, StringTie, Salmon, Sailfish, kallisto, RSEM
4. Normalize Across Samples: DESeq2 Normalization, Quantile Normalization, Scaling
5. Perform Differential Expression Anlaysis: DESeq2, edgeR
6. Perform Enirched Pathway and Transcription Factor Analysis: MSigDB, GSEA, String
---
### RNA-SEQ Analysis
<img src="images/diff_exp/rna-seq_workflow.jpg.webp" width="350">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/hisat.png" width="700">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/transcript_assembly.jpg.webp" width="700">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/merge_transcripts.png" width="900">
---
### Poisson Distribution
Sum of Bernoulli random variables, $X_{i}$, with probability of equaling 1 and 0 given by $p$ and $1-p$
$Y = \sum\_{i=1}^{n} X\_{i}$
is distributed as a binomial distribution with
$\mu = np$ and $\sigma^2 = np(1-p)$.
For $p \rightarrow 0$ and $n \rightarrow \infty$ such that $np = \lambda$,
$Y$ approaches a Poisson distribution.
---
### Poisson Distribution
Assume $Y$ is the number of reads mapping to a window in the genome coming from low coverage sequencing of a genome.
If the empirical probability of a read mapping to a specific location is $p \ll 1$,
and the number of bases in the window $n \sim 1000$,
the Poisson distribution is an excellent approximation of the distribution of $Y$.
---
### Poisson Distribution
<img src="images/diff_exp/genomic_read_count_dist.png" width="500">
---
### Poisson Distribution
<img src="images/diff_exp/poisson_dist.png" width="500">
---
### Negative Binomial Distribution
The negative binomial distribution is a mixture of a Poisson and a gamma distribution where the
Poisson distribution is $p_{P}(k|\lambda) = \frac{\lambda^{k}}{k!} e^{-\lambda}$
and the gamma distribution is $g(\lambda) = \frac{\lambda^{r-1} \beta^{r} e^{-\beta \lambda}}{\Gamma(r)}$
$$\begin{aligned}
p_{NB}(k) & = \int_{0}^{\infty} p_{P}(k|\lambda) g(\lambda) d \lambda \\\\\\
& = \frac{\Gamma(r+k)}{k! \Gamma(r)} p^{k} (1-p)^{r} \\\\\\
\end{aligned}$$
---
### Negative Binomial Distribution
Where we have used $\beta = (1-p)/p$. The mean and variance of a random variable $K \sim NB(\mu, \alpha)$ are
$E(K) = \mu$ and $Var(K) = \mu + \alpha \mu^2$
where the variance has a Poisson/"shot noise" term $\mu$ and a overdispersion/"biological variability" term $\alpha \mu^2$
---
### RNA-SEQ Analysis
1. DESeq2: Assumes read count $K_{ij}$ for gene $i$ in sample $j$ is described by a generalized linear model
2. $K_{ij} \sim NB(\mu_{ij}, \alpha_{i})$ (Negative Binomial with mean $\mu_{ij}$ and disperion $\alpha_{i}$)
3. $Var(K_{ij}) = \mu_{ij} + \alpha_{i} \mu_{ij}^2$
4. $\mu_{ij} = s_{ij} q_{ij}$ where $q_{ij}$ is proportional to gene $i$'s concentraton of cDNA fragments in sample $j$ and $s_{ij}$ is a size or normalization factor
---
### RNA-SEQ Analysis
1. Size factor accounts for differences in sequencing depth in a robust manner
2. Motivation: if gene $i$ is not differentially expressed between samples $j$ and $j^{'}$, then $E(K\_{ij})/E(K\_{ij^{'}}) = s\_{j}/s\_{j^{'}}$
3. Generalize this to multiple samples
4. Define pseudo-reference: $K_{i}^R = (\prod_{j=1}^{m} K_{ij})^{1/m}$
5. $s_{ij} = s_{j} = \textrm{median}\_{i} \frac{K_{ij}}{K_{i}^{R}}$
---
### RNA-SEQ Analysis
1. Fit generalized linear model to normalized counts
2. $\log_{2} q_{ij} = \sum_{r} x_{jr} \beta_{ir}$ where $x_{jr}$ are elements of the design matrix
3. Estmate $\alpha_{i}$ using shared information across genes assuming that genes with similar average expression have similar dispersion
4. Estimate each gene's dispersion using maximum likelihood
5. Fit smooth curve of dispersion estimate versus mean of prior distribution
---
### RNA-SEQ Analysis
6. Shrink estimates of gene-wise dispersion toward values predicted by the curve using empirical Bayes approach where size of shrinkage depends on
1. estimate of how close true dispersion is to the fit
2. degrees of freedom
7. Final esimtate of $\alpha_{i}$ is given by maximum *a posteriori* (MAP) estimate
8. Use gene-wise estimate if it is more than 2 residual standard deviations from shrunken estimate
---
### RNA-SEQ Analysis
<img src="images/diff_exp/DESeq2_Fig1.jpg.webp" width="900">
---
### RNA-SEQ Analysis
9. Address strong variance of log fold change (LFC) for genes with low read counts by shrinking LFC estimates toward zero where shrinkage is stronger when available information (low counts, dispersion high or few degrees of freedom) for a gene is lower.
---
### RNA-SEQ Analysis
10. Employ empirical Bayes procedure:
1. Perform GLM fits to obtain maximum likelihood estimates (MLEs) for the LFCs
2. Fit a zero-centered normal distribution to the observed distribution of MLEs over all genes
3. This distribution is used as prior on LFCs in second round of GLM fits
11. Final estimate of each LFC is given by MAP estimate
12. A standard error for each LFC estimate is derived from the posterior's curvature at its maximum
---
### RNA-SEQ Analysis
<img src="images/diff_exp/DESeq2_Fig2.jpg.webp" width="500">
---
### RNA-SEQ Analysis
13. Assess signficance using a Wald test:
1. the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic
2. which is used to calculate p-values from a standard normal distribution
14. Correct p-values for multiple hypothesis testing using Benjamini and Hochberg procedure
---
### RNA-SEQ Analysis
<img src="images/diff_exp/msigdb.png" width="5000">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/string.png" width="900">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/GSEA.webp" width="900">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/kegg.png" width="900">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/novel_transcripts.png" width="900">
---
### RNA-SEQ Analysis
<img src="images/diff_exp/lncrna.png" width="900">
---