-
Notifications
You must be signed in to change notification settings - Fork 0
/
PrEP_Study_Microarray_Analysis_PBMC_PAXgene_only.Rmd
283 lines (154 loc) · 7.94 KB
/
PrEP_Study_Microarray_Analysis_PBMC_PAXgene_only.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: 'PreP Study microarray analysis: PBMC and PAXgene only'
author: "Claire Levy"
date: "May 22, 2017"
output: html_document
---
---
title: "PreP Study Microarray Analysis"
author: "Claire Levy"
output: github_document
---
##Experimental set up
We isolated RNA from four different sample types from 8 donors pre- and post- initiation of PreP. The pre-initation was called "Enrollment" and the post-initiation visit was called "Visit2"
Sample Types
* Duodenal biopsy
* Rectal biopsy
* PAXgene (whole blood collected into RNA preservative)
* PBMC (PBMC isolated from whole blood)
NOTE: PTID BG2305 may not have been adherent, left them in for now. May have had problems refilling prescription on time.
### Analysis
Here I will subset the data to just the PAXgene and the PBMC samples and compare them at the enrollment time point.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(dplyr)
library(lumi)
library(limma)
library(pander)
library(stringr)
library(reshape2)
library(ggplot2)
```
```{r read in raw data, cache = TRUE, results = 'hide'}
lumibatch<-lumiR(fileName = "raw_data/2017.04.18.smhughesFinalReport.txt ",
detectionTh = 0.05,
annotationColumn=c('ENTREZ_GENE_ID','ACCESSION', 'SYMBOL', 'PROBE_SEQUENCE', 'PROBE_START', 'CHROMOSOME', 'PROBE_CHR_ORIENTATION', 'PROBE_COORDINATES', 'DEFINITION'))
#read in pData
#Note that my pData file has rownames that match the sampleNames used in the array AND I have a column called sampleNames containing the same data because I'll want to use it later when I make MDS plots and need to merge on sampleNames. This is annoying but kind of necessary.
pData <-read.table("raw_data/pData.txt", header = TRUE, stringsAsFactors = FALSE)
#put pData into an adf
adf<-AnnotatedDataFrame(data = pData)
complete_lumibatch<-new("LumiBatch",
assayData = assayData(lumibatch),
phenoData = adf,
detection = detection(lumibatch),
controlData = controlData(lumibatch),
featureData = featureData(lumibatch))
```
```{r subset data}
#subset for just enrollment and just paxgene or PBMC
pax_PBMC<-complete_lumibatch[ ,complete_lumibatch$Tissue =="PBMC" | complete_lumibatch$Tissue == "PAXgene"]
pax_PBMC_enroll<-pax_PBMC[,pax_PBMC$Visit == "Enrollment"]
```
Plots of non-normalized data
```{r nonnormalized data}
#MDS
dim_data<- plotMDS(pax_PBMC_enroll)
d<-data.frame(data.frame(Dim1 = dim_data$x,
Dim2 = dim_data$y,
sampleNames = names(dim_data$x)))
d<-merge(d, pData, by = "sampleNames")
ggplot(d, aes(x=Dim1, y = Dim2))+
geom_point(aes(color = Tissue), size = 3)
#A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1.
par(cex= 0.7, mar = c(10,10,10,10))
boxplot(pax_PBMC_enroll, col = c("#1b9e77","#984ea3"))
#This plot shows that the PBMC sample are uniformly higher-expressing than the paxgene samples, but to do this comparison I need to normalize them together.
plotDensities(pax_PBMC_enroll,legend = FALSE, main = "non-normalized PAXgene and PBMC, enrollment only")
```
```{r background correction, results = 'hide'}
#the data we got from the core has no background correction (I don't think it did anyway...) so I will do it here.
B_pax_PBMC_enroll<-lumiB(pax_PBMC_enroll, method = "bgAdjust")
```
```{r vst transform and robust spline normalization, results = 'hide'}
TB_pax_PBMC_enroll<- lumiT(B_pax_PBMC_enroll)
NTB_pax_PBMC_enroll<-lumiN(TB_pax_PBMC_enroll, method = "rsn")
```
These normalized samples all look good.
```{r MDS plot function}
# a function to make nice MDS plots
#plotMDS invisibly returns a large object that has the actual dimensions data in matrices called x and y (one for each dimension)
#Dim1 is a column containing the values from the x matrix
#Dim2 is a column containing the values from the y matrix
#sampleNames is a column that holds the column names from the x matrix, which happen to be the microarray wells.
#add a plot title to this function!!!
plot_MDS<-function(batch){
dim_data<- plotMDS(batch)
d<-data.frame(data.frame(Dim1 = dim_data$x,
Dim2 = dim_data$y,
sampleNames = names(dim_data$x)))
d<-merge(d, pData, by = "sampleNames")
ggplot(d, aes(x=Dim1, y = Dim2))+
geom_point(aes(color = Tissue), size = 3)
}
```
```{r norm plots}
plotDensities(NTB_pax_PBMC_enroll,legend = FALSE, main = "Normalized PAXgene and PBMC, enrollment")
boxplot(NTB_pax_PBMC_enroll,main = "Normalized PAXgene and PBMC, enrollment only")
plot_MDS(NTB_pax_PBMC_enroll)
```
## Non-specific filtering
I will filter out any probes that are not expressed above the 0.05 p-value cut-off in at least 6 samples. I chose 6 because a probe may not be expressed at all in the 8 Enrollment samples (so 0/16 total), but may show up at Visit2 (8 possibilities). It would still be biologically interesting if a probe was expressed at Visit2 (and not at enrollment) even if it wasn't in all donors, so I'll require it to show up in at least 6 of them.
```{r ns filtering}
#count probes where det pval <0.05 in at least 7 samples
#extract those probes from the batch
find_expressed <-function(batch){
x <-rowSums(detection(batch)<0.05)>= 7
batch[x,]
}
px_PBMC_expressed<-find_expressed(NTB_pax_PBMC_enroll)
```
Number of probes removed from the data sets after filtering for expression. Started with 47323 probes.
```{r probes pre and post filter}
#here are the probes that were removed by NS filtering for each tissue
probes_removed<- data.frame(
"PAXgene and PBMC at enrollment" = nrow(fData(NTB_pax_PBMC_enroll))-nrow(fData(px_PBMC_expressed)))
pander(probes_removed)
```
```{r design and fit function}
#When I have ~0 for the intercept, each coeff represents the avg of the samples at each level of Condition and Donor. If I use ~1 instead, the intercept would represent the avg of the samples for the first factor and the next coeff would represent the increase in the average of the next factor OVER the first. See here:https://support.bioconductor.org/p/67395/ and here https://support.bioconductor.org/p/12145/
Ptid <- pData(px_PBMC_expressed)$Ptid
Tissue <- pData(px_PBMC_expressed)$Tissue
pax_PBMC_enroll_design <- model.matrix(~0+Tissue + Ptid)
pax_PBMC_enroll_fit<-lmFit(px_PBMC_expressed, design = pax_PBMC_enroll_design)
#This means that the results will be relative to PBMC tissue
pax_PBMC_enroll_contrasts<-makeContrasts(
PAXgeneEnrollVsPBMCEnroll = TissuePBMC-TissuePAXgene, levels = pax_PBMC_enroll_design)
pax_PBMC_enroll_fit2 <- contrasts.fit(pax_PBMC_enroll_contrasts, fit = pax_PBMC_enroll_fit)
pax_PBMC_enroll_fit2 <-eBayes(pax_PBMC_enroll_fit2)
```
```{r results summary function}
#results summary function
#method = separate is same as setting the limits for all coefs separately while "global" sets them all at once. I don't think this matters for me since I only have one contrast. I get the same answer if I used global or separate
fit_results<- function(fit2){
results <- decideTests(fit2,method="global", adjust.method="BH",
p.value=0.05, lfc=0.5)
resultsDF<-as.data.frame(results)
resultsDF$ProbeID<-rownames(resultsDF)
rownames(resultsDF)<-NULL
#melt the df for easy summarizing
resultsDFmelt<-melt(resultsDF, id.vars="ProbeID")
summary<-resultsDFmelt %>%
group_by(variable)%>%
summarize(down=sum(value=="-1"),up=sum(value=="1"))
pander(summary)
}
pax_PBMC_enroll_summary<-fit_results(pax_PBMC_enroll_fit2)
```
```{r toptable}
tt_pax_PBMC_enroll <- topTable(pax_PBMC_enroll_fit2, coef = "PAXgeneEnrollVsPBMCEnroll", adjust.method = "BH", number = Inf, p.value = 0.05, lfc = 0.5 )
pander(tt_pax_PBMC_enroll %>%
mutate(Direction = ifelse(logFC<0, "DOWN", "UP"))%>%
select (TargetID, Direction, DEFINITION, adj.P.Val)%>%
head(.))
```