slides/diff_exp.html

---
layout: reveal_markdown
title: "Differential Expression Analysis"
tags: slides 
date: 2022-01-22
---

### Transcriptome

<img src="images/diff_exp/transcriptome.png" width="900">

---

### Transcriptome

<img src="images/diff_exp/alternative_splicing.png" width="900">

---

### Gene Expression Analysis

1. Quality Control
2. Estimate Expression Level
3. Normalize Across Samples
4. Perform Differential Expression Anlaysis
5. Perform Enirched Pathway and Transcription Factor Analysis 

---

## Microarray Analysis

1. Quality Control
2. Estimate Expression Level: RMA, GCRMA, dChip
3. Normalize Across Samples: Quantile Normalization, Scaling
4. Perform Differential Expression Anlaysis: limma
5. Perform Enirched Pathway and Transcription Factor Analysis

---

### Microarray Analysis

<img src="images/diff_exp/crop_circles.png" width="800">

---

### Microarray Analysis

<img src="images/diff_exp/ma_pre-norm.png" width="900">

<img src="images/diff_exp/ma_post-norm.png" width="900">

---

### Microarray Analysis

<img src="images/diff_exp/probe_effects.png" width="900">

---

### Microarray Analysis

1. dChip: Analyzes multiple chips simultaneously  
2. For array $i$, probe $j$ and number of probe pairs $J$
3. $p_{ij} = PM_{ij} - MM_{ij} = \theta_{i} \phi_{j} + e_{ij}$
4. $\theta_{i}$: relative expression level; $\phi_{j}$: relative affinity
5. $\sum_{j=1}^{J} \phi_{j}^{2} = J$ (constraint)
6. $e_{ij} \sim N(0,\sigma^2)$: addative error model
7. Iteratively fit equations excluding outlier $\theta_{i}$ and $\phi_{j}$
8. Effective expression estimate $\theta_{i} = \frac{1}{J}\sum_{j} p_{ij} \phi_{j}$ 


---

### Microarray Analysis

1. Problems with dChip (Li-Wong model)
2. $\log(PM)$, $\log(MM)$ tend to be normally distributed
3. $MM$ tends to capture significant amount of intended target: lowers sensitivity
4. $MM$ introduces second "noisey" intensity: increases variance

---

### Microarray Analysis

1. Robust Multiarray Average (RMA) (Li-Wong on log-scale; no MM):

$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b + e_{ij}$

2. Estimate relative $\log_{2} \theta_{i}$ robustly: median polish

---

### Microarray Analysis

<img src="images/diff_exp/rma_background_model.png" width="1000">

---

### Microarray Analysis

1. GCRMA
2. Background strongly depends on probe sequence, so modeled as function of sequence ($S_{ij}$) of probes
3. RMA with sequence dependent background correction

$\log_{2}(PM_{ij}) = \log_{2}(\theta_{i}) + \log_{2}(\phi_{j}) + b(S_{ij},MM_{ij}) + e_{ij}$      

---

### Microarray Analysis

1. Linear Models for Microarray Data (limma)
2. Assume linear model: $E[\mathbf{y}\_{j}] = \mathbf{X} \alpha\_{j}$
3. $\mathbf{y}_{j}$ expression values for gene $j$
4. $\mathbf{X}$ is the design matrix
5. $\alpha_{j}$ is vector of coefficients
6. $\mathbf{y}^{T}_{j}$: jth row of expression matrix (log-intensities)
7. Contrasts: $\beta_{j} = \mathbf{C}^{T} \alpha_{j}$ ($\mathbf{C}$: contrast matix)

---

### Microarray Analysis

1. Significance analysis: moderated t-statistic
2. Uses borrowed information from ensemble of genes 
3. Ordinary t-statistic with 
   1. standard errors shrunk toward common value
   2. increased degrees of freedom (greater reliability associated with smoothed standard errors) 

---

### Microarray Analysis

1. Linear model for gene $j$ has residual variance $\sigma_{j}^2$ with sample value $s_{j}^2$ and degrees of freedom $f_{j}$
2. Covariance matrix of estimated $\hat{\beta}\_{j}$ is $\sigma\_{j}^2 \mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}\_{j} \mathbf{X})^{-1} \mathbf{C}$
3. $\mathbf{V}\_{j}$ is a weight matrix: prior weights, covariance terms introduced by correlation strucuture and interative weights introduced by robust estimation
4. Unscaled standard deviation ($u_{jk}$): square roots of diagonal elements of $\mathbf{C}^T(\mathbf{X}^{T} \mathbf{V}\_{j} \mathbf{X})^{-1} \mathbf{C}$ 

---

### Microarray Analysis  

1. Ordinary t-statistic for kth contrast and gene $j$: $t_{jk} = \hat{\beta}\_{jk}/(u_{jk} s_{j})$
2. Empirical Bayes method assumes inverse Chi-square prior for the $\sigma_{j}^2$ with mean $s_{0}^2$ and degrees of freedom $f_{0}$
3. Posterior values for residual variances given by 

$\tilde{s}\_{j}^2 = \frac{f_{0} s_{0}^{2} + f_{j} s_{j}^{2}}{f_{0} + f_{j}}$

---

### Microarray Analysis 

1. Moderated t-statistic: $\tilde{t}\_{jk} = \frac{\hat{\beta}\_{jk}}{u_{jk} \tilde{s}\_{j}}$
2. Follows t-distribution with $f_{0} + f_{1}$ degrees of freedom if $\hat{\beta}\_{jk} = 0$
3. Extra degree of freedom $f_{0}$ represents borrowed information from ensemble of genes for each gene's inference

---

### Microarray vs RNA-SEQ

<img src="images/diff_exp/ma_rna-seq_signal_scatter.png" width="900">

---

### Microarray vs RNA-SEQ

<img src="images/diff_exp/ma_rna-seq_fc_scatter.png" width="900">

---

### Microarray vs RNA-SEQ

<img src="images/diff_exp/ma_rna-seq_venn.png" width="900">

---

### RNA-SEQ Analysis

1. Quality Control: FASTQC
2. Splice-Aware Alignment: HISAT, STAR
3. Estimate Expression Level: featureCounts in Rsubread, StringTie, Salmon, Sailfish, kallisto, RSEM  
4. Normalize Across Samples: DESeq2 Normalization, Quantile Normalization, Scaling
5. Perform Differential Expression Anlaysis: DESeq2, edgeR
6. Perform Enirched Pathway and Transcription Factor Analysis: MSigDB, GSEA, String

---

### RNA-SEQ Analysis

<img src="images/diff_exp/rna-seq_workflow.jpg.webp" width="350">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/hisat.png" width="700">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/transcript_assembly.jpg.webp" width="700">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/merge_transcripts.png" width="900">

---

### Poisson Distribution

Sum of Bernoulli random variables, $X_{i}$, with probability of equaling 1 and 0 given by $p$ and $1-p$ 

$Y = \sum\_{i=1}^{n} X\_{i}$

is distributed as a binomial distribution with 

$\mu = np$ and $\sigma^2 = np(1-p)$.  

For $p \rightarrow 0$ and $n \rightarrow \infty$ such that $np = \lambda$, 

$Y$ approaches a Poisson distribution.

---

### Poisson Distribution

Assume $Y$ is the number of reads mapping to a window in the genome coming from low coverage sequencing of a genome.  

If the empirical probability of a read mapping to a specific location is $p \ll 1$, 

and the number of bases in the window $n \sim 1000$, 

the Poisson distribution is an excellent approximation of the distribution of $Y$.   

---

### Poisson Distribution

<img src="images/diff_exp/genomic_read_count_dist.png" width="500"> 
	
---

### Poisson Distribution

<img src="images/diff_exp/poisson_dist.png" width="500"> 

---

### Negative Binomial Distribution

The negative binomial distribution is a mixture of a Poisson and a gamma distribution where the

Poisson distribution is $p_{P}(k|\lambda) = \frac{\lambda^{k}}{k!} e^{-\lambda}$

and the gamma distribution is $g(\lambda) = \frac{\lambda^{r-1} \beta^{r} e^{-\beta \lambda}}{\Gamma(r)}$

$$\begin{aligned}
p_{NB}(k) & = \int_{0}^{\infty} p_{P}(k|\lambda) g(\lambda) d \lambda \\\\\\ 
& = \frac{\Gamma(r+k)}{k! \Gamma(r)} p^{k} (1-p)^{r} \\\\\\
\end{aligned}$$

---

### Negative Binomial Distribution

Where we have used $\beta = (1-p)/p$. The mean and variance of a random variable $K \sim NB(\mu, \alpha)$ are 

$E(K) = \mu$ and $Var(K) = \mu + \alpha \mu^2$

where the variance has a Poisson/"shot noise" term $\mu$ and a overdispersion/"biological variability" term $\alpha \mu^2$ 

---

### RNA-SEQ Analysis

1. DESeq2: Assumes read count $K_{ij}$ for gene $i$ in sample $j$ is described by a generalized linear model 
2. $K_{ij} \sim NB(\mu_{ij}, \alpha_{i})$ (Negative Binomial with mean $\mu_{ij}$ and disperion $\alpha_{i}$)
3. $Var(K_{ij}) = \mu_{ij} + \alpha_{i} \mu_{ij}^2$
4. $\mu_{ij} = s_{ij} q_{ij}$ where $q_{ij}$ is proportional to gene $i$'s concentraton of cDNA fragments in sample $j$ and $s_{ij}$ is a size or normalization factor 

---

### RNA-SEQ Analysis

1. Size factor accounts for differences in sequencing depth in a robust manner
2. Motivation: if gene $i$ is not differentially expressed between samples $j$ and $j^{'}$, then $E(K\_{ij})/E(K\_{ij^{'}}) = s\_{j}/s\_{j^{'}}$
3. Generalize this to multiple samples
4. Define pseudo-reference: $K_{i}^R = (\prod_{j=1}^{m} K_{ij})^{1/m}$
5. $s_{ij} = s_{j} = \textrm{median}\_{i} \frac{K_{ij}}{K_{i}^{R}}$

---

### RNA-SEQ Analysis

1. Fit generalized linear model to normalized counts
2. $\log_{2} q_{ij} = \sum_{r} x_{jr} \beta_{ir}$ where $x_{jr}$ are elements of the design matrix
3. Estmate $\alpha_{i}$ using shared information across genes assuming that genes with similar average expression have similar dispersion
4. Estimate each gene's dispersion using maximum likelihood
5. Fit smooth curve of dispersion estimate versus mean of prior distribution

---

### RNA-SEQ Analysis

6. Shrink estimates of gene-wise dispersion toward values predicted by the curve using empirical Bayes approach where size of shrinkage depends on
   1. estimate of how close true dispersion is to the fit
   2. degrees of freedom
7. Final esimtate of $\alpha_{i}$ is given by maximum *a posteriori* (MAP) estimate
8. Use gene-wise estimate if it is more than 2 residual standard deviations from shrunken estimate

---

### RNA-SEQ Analysis

<img src="images/diff_exp/DESeq2_Fig1.jpg.webp" width="900"> 

---

### RNA-SEQ Analysis

9. Address strong variance of log fold change (LFC) for genes with low read counts by shrinking LFC estimates toward zero where shrinkage is stronger when available information (low counts, dispersion high or few degrees of freedom) for a gene is lower.

---

### RNA-SEQ Analysis

10. Employ empirical Bayes procedure:
    1. Perform GLM fits to obtain maximum likelihood estimates (MLEs) for the LFCs
    2. Fit a zero-centered normal distribution to the observed distribution of MLEs over all genes
    3. This distribution is used as prior on LFCs in second round of GLM fits
11. Final estimate of each LFC is given by MAP estimate
12. A standard error for each LFC estimate is derived from the posterior's curvature at its maximum

---

### RNA-SEQ Analysis

<img src="images/diff_exp/DESeq2_Fig2.jpg.webp" width="500"> 

---

### RNA-SEQ Analysis

13. Assess signficance using a Wald test:
    1. the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic 
    2. which is used to calculate p-values from a standard normal distribution
14. Correct p-values for multiple hypothesis testing using Benjamini and Hochberg procedure

---

### RNA-SEQ Analysis

<img src="images/diff_exp/msigdb.png" width="5000">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/string.png" width="900">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/GSEA.webp" width="900">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/kegg.png" width="900">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/novel_transcripts.png" width="900">

---

### RNA-SEQ Analysis

<img src="images/diff_exp/lncrna.png" width="900">

---