-
Notifications
You must be signed in to change notification settings - Fork 5
/
mediationanalysis.Rmd
243 lines (178 loc) · 8.87 KB
/
mediationanalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Mediation analysis using Propensity scoring"
output:
html_document:
number_sections: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
```
# Mediation {-}
Microbiome is in many cases suspected to be the mechanistic entity causing a correlation between a treatment and a phenotype.
Think about the hygiene hypothesis, where the environment (siblings / household-size, birth mode, rural living, antibiotics etc.) is found related to a range of chronic diseases including asthma, diabetes, allergy etc. Here, one speculates that the environment causes a shift in the microbiome which then in turn gives rise to the disease.
This is classical mediation setup, just with the tweek that the microbiome is not a univariate entity, but merely a multivariate feature set.
# Propensity scoring {-}
Propensity scoring is the concept where the relation between say a treatment and a multivariate data block is turned into a univariate vector.
This can sound tricky, but in principle, what is done is that a machine learning model is build between the treatment and the multivariate data and the predictions is then used as this propensity.
# An example {-}
The mice birth mode data has a total of n = 31 mice delivered by either vaginal birth or c-section.
Microbiome data is collected, and the previous exercises on these data clearly shows that indeed the birth mode matters on the microbiome.
Additionally, measures of immune activity is collected as well (available from the sample_data).
In order to do mediation analysis of how the microbiome mediates the effect of birth mode to immune functionality encompasses the following elements
* Establish a univariate propensity score by classification of birth mode using microbiome data.
* Run analysis of birth mode versus immune mediators
* Run analysis of propensity score versus immune mediators
* Combined mediation analysis on markers with univariate effects
## Classification of Birth mode using microbiome data {-}
Here we use the general ML package caret to train and crossvalidate a model classifying birth mode given microbiome data.
It takes a fair bit of data wrangling to make it work, but google the scripting and/or copy paste below
### Setting up the data {-}
```{r}
library(phyloseq)
library(caret)
library(broom)
library(tidyverse)
load('./data/Mice_csec.RData')
# Preprocess and define GM samples
pc <- 0.0001 # psedo count for log-trans
phyXpp <- phyX %>%
transform_sample_counts(function(x) x / sum(x)) %>%
filter_taxa(function(x) sum(x > 0)>10, TRUE) %>%
transform_sample_counts(function(x) log(x + pc))
# extract feature matrix
X <- phyXpp@otu_table %>% t %>% data.frame()
# add classes
X$trt <- as.vector(phyXpp@sam_data$Birth_mode) %>%
gsub('-','',x = .) # this gsub() fixes the labels
# remove features with 0 variance
sdd <- c(apply(phyXpp@otu_table,1,sd),200)
X <- X[,sdd>0]
```
### Running ML model {-}
```{r, eval=FALSE}
# some formal settings
repCV10 <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
classProbs = T,
returnResamp = "all",
savePredictions = "all",
allowParallel = T,
verboseIter = F)
# Run the model including cross-validation
mdlSPLS <- train(trt ~ ., data = X,
method = 'spls',
preProc = c("center", 'scale'),
tuneGrid = expand.grid(K = 1:3, # K sets number of components
eta = seq(0,0.9,0.1), # eta sets sparsity
kappa = 0.5),
trControl = repCV10)
```
```{r, eval=F, include=FALSE}
save(file = 'propensitymodel.RData',list = c('mdlSPLS'))
```
```{r, eval=T, include=FALSE}
load('propensitymodel.RData')
```
### Plot the model {-}
```{r}
plot(mdlSPLS)
# extract the best model
best(mdlSPLS$results, metric = 'Accuracy', maximize = T)
mdlSPLS$results[best(mdlSPLS$results, metric = 'Accuracy', maximize = T),]
```
The best model (in a CV sense) is used to generate the propensities wrt. classes.
Although we can use what is optimal in terms of crossvalidation, it may be a wise choice to go for a model that are simpler but with somewhat the same performance.
```{r}
# compute the average prediction (a probability) from a model with 3 components and eta = 0.4
cvPred <- mdlSPLS$pred %>%
filter(K==2) %>%
filter(abs(eta-0.5)<0.01) %>%
dplyr::mutate(pred = Csection %>% as.numeric()) %>%
group_by(rowIndex) %>%
dplyr::summarise(yhatCV = mean(pred, na.rm = T))
# add all the meta data
cvPred <- data.frame(cvPred, phyXpp@sam_data)
```
Lets have a look at the propensity scores (y-axis) versus the classes they were trained on.
```{r}
cvPred %>%
ggplot(data = ., aes(Birth_mode,yhatCV)) +
geom_boxplot() + geom_jitter()
```
What you see in this plot is that there is a range of propensity scores within each class.
In essens that mean, that some of the samples from the c-section born group actually have a microbiome mimiking the vaginal groups and vice-versa.
Also you see some samples with rather similar propensity-scores but from different groups.
## Univariate Exposure - Endpoint models {-}
Next step is to establish an association between the exposure (birth mode) and the endpoint (immuno-markers)
Here we use a simple linear model
```{r}
tb_expo_ep <- cvPred %>%
gather(var,val,MLNCD3.CD19.all:SPLEENCD69.FoxP3.CD4.1) %>%
group_by(var) %>%
do(lm(data = ., val~Birth_mode) %>% tidy) %>%
filter(term!='(Intercept)')
tb_expo_ep %>%
ggplot(data =. , aes(estimate,-log10(p.value), label = var)) +
geom_point() +
ggrepel::geom_text_repel()
```
Here we see that some MLNxxx markers are higher in vaginal born mice, while SpleenCD1 markers are lower.
## Univariate Microbiome propensity score - Endpoint models {-}
Here we use a correlation analysis, or alternatively, in case of covariates, linear regression.
```{r}
tb_GM_ep <- cvPred %>%
gather(var,val,MLNCD3.CD19.all:SPLEENCD69.FoxP3.CD4.1) %>%
group_by(var) %>%
do(cor.test(data = ., ~val + yhatCV, method = 'spearman') %>% tidy)
tb_GM_ep %>%
ggplot(data = ., aes(estimate,-log10(p.value), label = var)) +
geom_point() +
ggrepel::geom_text_repel()
```
To some extend this mimik the findings from the exposure - endpoint models, as the propensity score is high when the microbiome is very C-section like.
## Mediation analysis {-}
For a few immune markers we association both towards exposure (birth mode) as well as towards the microbiome propensity of birth mode.
For instance _MLNCD69.FoxP3.CD4_, _MLNFoxP3.CD4_ and _SpleenCD1d.CD3..iNKT.cells_, lets run a mediation analysis to reveal whether the effect of the birth mode on the immune markers is through the microbiome or simply just two seperate effects.
```{r}
library(mediation)
### FOR one example (MLNCD69.FoxP3.CD4)
# setup the data frame (removing missing entries)
df <- cvPred %>% filter(!is.na(MLNCD69.FoxP3.CD4))
# build two models
## mediator models
mediator_model <- lm(yhatCV ~ Birth_mode, df)
## outcome model with both exposure and mediator as explanatory variables
outcome_model <- lm(MLNCD69.FoxP3.CD4 ~ Birth_mode + yhatCV, df)
## Run the mediation analysis and get stats out.
results <- mediate(mediator_model, outcome_model, treat= 'Birth_mode', mediator= 'yhatCV' ,boot=TRUE, sims=500)
summary(results)
```
Here ideed the total effect seems to be mediated (ACME = Average Causal Mediated Effect). While the direct is not significant.
...here is another example
```{r}
# setup the data frame (removing missing entries)
df <- cvPred %>% filter(!is.na(SpleenCD1d.CD3..iNKT.cells.))
# build two models
## mediator models
mediator_model <- lm(yhatCV ~ Birth_mode, df)
## outcome model with both exposure and mediator as explanatory variables
outcome_model <- lm(SpleenCD1d.CD3..iNKT.cells. ~ Birth_mode + yhatCV, df)
## Run the mediation analysis and get stats out.
results <- mediate(mediator_model, outcome_model, treat= 'Birth_mode', mediator= 'yhatCV' ,boot=TRUE, sims=500)
summary(results)
```
Here the effect is somewhat mediated but also some is direct though not significant.
## Plot of three examples {-}
```{r}
cvPred %>%
gather(var,val,MLNCD3.CD19.all:SPLEENCD69.FoxP3.CD4.1) %>%
filter(var %in% c('SpleenCD1d.CD3..iNKT.cells.','MLNCD69.FoxP3.CD4','MLNFoxP3.CD4')) %>%
ggplot(data = ., aes(yhatCV,val, color = Birth_mode)) +
geom_point() +
stat_smooth(method = lm, se = F) +
facet_wrap(~var, scales = 'free') +
theme(legend.position = 'bottom')
```
For the MLNxxx types the story is pretty clear on a fair part of the effect is through GM.
For the SpleenCD1 marker it is not so obvious. In the C-section arm is seems as the microbiome promote a higher response when very c-section like. However, in the Vaginal arm - which are just lower - there is no correlation.