-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy patha2_child.Rmd
253 lines (200 loc) · 8.53 KB
/
a2_child.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: "Assignment 2: Differential Gene expression and Preliminary ORA"
author: "Priyanka Narasimhan"
output: html_document
---
```{r setup2, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, cache = TRUE, fig.asp = 1)
```
#### Differential Gene Expression
#### 1 - Load our clean, normalized data from A1.
```{r, include=FALSE}
packages = c("kableExtra", "ComplexHeatmap", "circlize")
bioconductor_packages = c("BiocManager", "Biobase", "ComplexHeatmap", "limma", "edgeR")
for (package in packages){
install.packages(package)
}
for(package in bioconductor_packages){
BiocManager::install(package, update = FALSE)
}
# Load the libraries
library(kableExtra)
library(limma)
library(ComplexHeatmap)
library(edgeR)
library(Biobase)
```
Let us also remove all unmapped rows from A1. Here are the dimensions of the new data set.
```{r}
normalized_count_data <- read.csv("counts_filtered.csv")
normalized_count_data <- normalized_count_data[!is.na(normalized_count_data$hgnc_symbol),]
dim(normalized_count_data) # returns [1] 13608 19
```
We are given 19 columns (consisting of 2 columns with the Ensembl genes and the HUGO symbols, and 17 samples).
```{r}
cols <- normalized_count_data[,3:19]
colnames(cols)
```
We have 6 different groups(samples) to compare, but that may be too many at once. Since this experiment wants to prove there is an additional effect of ELF1 genes outside of interferons, let us choose the following 4 samples: empty_IFN48, empty_IFN6, ELF1_mock, R8A_mock <br />
where: <br />
empty_IFN48 & empty_IFN6 are the interferon samples
ELF1_mock & R8A_mock are the ELF1 samples
```{r}
chosen_samples <- normalized_count_data[,3:19]
chosen_samples <- chosen_samples[, c(4, 5, 6, 16, 17, 7, 8, 9, 10, 11, 12)]
chosen_samples
```
*Groups re-defined*
```{r}
samples <- data.frame(lapply(colnames(chosen_samples)[1:11],
FUN = function(x) {
unlist(strsplit(x, split="_"))[c(1, 3, 2)]
}))
colnames(samples) <- colnames(chosen_samples)[1:11]
rownames(samples) <- c("Treatment", "trial_num", "mock_or_IFN")
samples <- data.frame(t(samples))
samples$Test_run = paste(samples$Treatment,samples$mock_or_IFN, sep = "_")
samples
```
*Create MDS Plot:*
I was unable to create an MDS plot on the desirable sample groups in A1, so here's one now!
```{r}
filtered_data_matrix <- as.matrix(chosen_samples)
rownames(filtered_data_matrix) <- normalized_count_data$ensembl_gene_id
#filtered_data_matrix
d = DGEList(counts=filtered_data_matrix, group=samples$Test_run)
limma::plotMDS(d, col = c("darkgreen","blue", "cyan", "green")[factor(samples$Test_run)],
main = "ELF1 vs Interferons Sample groups after TMM normalization")
```
The dissimilarity shown between the 2 groups with ELF1 gene and with interferons is clear. Hence the appropriate factor to use would be Treatment: ELF1 vs Interferons. We can look for any significant differences between upregulated and downregulated genes between ELF1 tests and stimulation after using interferons.
Let us classify the samples that were transduced with ELF1 and the samples that were stimulated under interferon-beta.
Using the above MDS plot, and chosen sample groups of interest, we have decided to classify by treatment types, and we can create our model.
*2 - Define model design and factors*
```{r}
sample_groups <- samples[, c(2, 4)]
sample_groups
model_design <- model.matrix(~ sample_groups$trial_num)
#kable(model_design, type="html")
expressionMatrix <- as.matrix(chosen_samples)
rownames(expressionMatrix) <- normalized_count_data$ensembl_gene_id
colnames(expressionMatrix) <- colnames(chosen_samples)
minimalSet <- ExpressionSet(assayData=expressionMatrix)
dim(minimalSet)
#Fit our data to the above model
fit <- lmFit(minimalSet, model_design)
```
Fit our data to the above model (Referred to lecture 5 slides, slides 4-6):
```{r}
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:10,],type="html")
# Referred to lecture 5 slides, slides 21-23
```
<br />
I chose to keep the significance threshold as the default 0.05, and from this the number of genes that did not pass the threshold are: <br />
Genes that pass the threshold:
```{r}
length(which(output_hits$P.Value < 0.05)) # returns 108
```
Genes that pass adjusted p value:
```{r}
length(which(output_hits$adj.P.Val < 0.05)) # returns 0
```
Notice that the adjusted p-values are correct. This simply means there is no evidence at all for rejecting the null hypothesis. This is not very promising as the 108 genes that were significantly expressed before the adjustment all dissapeared, and after adjustment of the p-value we have none.
Let us try multiple hypothesis testing to see if we can get better adjusted p values.
*3 - Multiple Hypothesis Testing*
Create a different linear model that includes 2 parameters, trial number and trial type. Here are the adjusted p-values:
```{r}
d = DGEList(counts=filtered_data_matrix, group=sample_groups$trial_num)
model_design_two_params <- model.matrix(
~ sample_groups$Test_run + sample_groups$trial_num)
#kable(model_design_two_params,type="html")
```
```{r}
fit_params <- lmFit(minimalSet, model_design_two_params)
fit2_params <- eBayes(fit_params,trend=TRUE)
topfit_params <- topTable(fit2_params,
coef=ncol(model_design_two_params),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_params <- merge(normalized_count_data[,1:2],
topfit_params,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_params <- output_hits_params[order(output_hits_params$P.Value),]
kable(output_hits_params[1:10,],type="html")
```
Genes that pass the threshold:
```{r}
length(which(output_hits_params$P.Value < 0.05))
```
Genes that pass adjusted p value:
```{r}
length(which(output_hits_params$adj.P.Val < 0.05))
```
```{r}
d = DGEList(counts=filtered_data_matrix, group=sample_groups$Test_run)
model_design_pat <- model.matrix(
~ sample_groups$trial_num + sample_groups$Test_run)
#estimate dispersion
d <- estimateDisp(d, model_design_pat)
#calculate normalization factors
d <- calcNormFactors(d)
#fit model
fit <- glmQLFit(d, model_design_pat)
#calculate differential expression
qlf.pos_vs_neg <- glmQLFTest(fit, coef='sample_groups$trial_num2')
```
```{r}
#Get all the results
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(filtered_data_matrix))
#length(which(qlf_output_hits$table$PValue < 0.05))
#length(which(qlf_output_hits$table$FDR < 0.05))
```
```{r}
kable(topTags(qlf.pos_vs_neg), type="html")
```
Find out the number of upregulated genes:
```{r}
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0))
```
... and downregulated genes
```{r}
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0))
```
```{r}
# Referred to lecture slides for this portion
#merge gene names with the top hits and collect the down regulated and upregulated genes
qlf_output_hits_withgn <- merge(normalized_count_data[,1:2],qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
all_genes <- c(upregulated_genes, downregulated_genes)
write.table(x=upregulated_genes,
file=file.path("data","upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=all_genes,
file=file.path("data","all_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
```