-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgoby.Rmd
158 lines (134 loc) · 6.38 KB
/
goby.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Goby survival"
author: "Luke Yates"
date: "24/11/2021"
bibliography: modSel.bib
link-citations: true
output:
html_document:
number_sections: false
toc: false
toc_float:
collapsed: false
smooth_scroll: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(emdbook) # version 1.3.12
library(bbmle) # version 1.0.23.1
library(pbmcapply)
knitr::opts_chunk$set(echo = TRUE, dev="png",
dev.args=list(type="cairo"),
dpi=96)
```
## Introduction
## Load data and specify models
The Goby survival data set contains measurements from experimental manipulations on *Elacatinus evelynae* and *E. prochilos* in the US Virgin Islands, 2000-2002 [@Wilson2004]. The data are compiled from five experiments, $x = 1,2,...,5$, collected over three years, and include information on animal density $d$ and habitat quality $q$. The fraction of surviving Gobies $T(t)$ are modelled using the Weibull distribution; the aim of the analysis is to investigate the effect of density and habitat quality on mortality rate. Following the analysis of [@Bolker2008] (p276-283), we consider a set of 10 candidate models characterised by the dependence of the scale parameter $s$ on the variables $x$, $d$ and $q$. The most complex model, denoted ($\mbox{xqdi}$), is
\begin{eqnarray*}
&&T \sim Weibull(a,s_x(d,q))\\[2mm]
&&s_x(d,q) = exp(\alpha_x + \beta q + (\gamma + \delta q)d),
\end{eqnarray*}
with $s$ depending on $x,q,d,$ and an interaction $i$, between $q$ and $d$. Models of lower complexity, with various dependencies omitted, are labelled in the same manner, together with ($\mbox{zero}$), which denotes the simplest model (a single shape and scale parameter). The data set contains 369 rows.
We load the data from the R package `emdbook` (with thanks to Jackie Wilson for giving permission for their use) and make a couple of small preparations
```{r load data}
library(emdbook)
data(GobySurvival)
GobySurvival$day1 <- GobySurvival$d1-1
GobySurvival$day2 = ifelse(GobySurvival$d2==70,Inf,GobySurvival$d2-1)
```
We load a script containing functions to specify, compute and (numerically) optimise the model likelihoods (see GitHub repo).
```{r load goby functions}
source("goby_functions.R")
# nll.xqdi computes negative log-likelihood of the full model given data
# nll.pred computes predictive negative log-likelihood for given fitted model and test data
# fit.goby fits all 10 goby survival models using MLE, given data
```
For example, the function `nll.xqdi` returns the negative log-likelihood of the full model ($\mbox{xqdi}$):
```{r}
nll.xqdi
```
The function `fit.goby` specifies all 10 models (by fixing subset of the paramters to 0) and fits them using the numerical optimiser `bbmle::mle2`, returning a list of the fitted objects. For example, we use the function to fit all models to the full data set and print the fitted object for the ($\mbox{qd}$) model:
```{r}
m.fits <- fit.goby(GobySurvival)
m.fits$qd
```
For later use, it will be handy to have the models' names and dimensions
```{r}
m.dims <- sapply(m.fits, function(m) attr(logLik(m),"df"))
m.names <- names(m.fits)
```
## Apply LOO
```{r load loo, echo = FALSE}
m.loo.pw <- readRDS("goby_loo.rds")
```
To apply leave-one-out CV, we define the following function which uses `pbmclapply`---a parallel implementation of `lapply`---to compute the pointwise predictive log-likelihood (`nll.pred` function) for all 369 LOO data splits for all 10 models. Using 30 cores, computation time is reduced from 15 mins to just 30 secs (i.e., embarrassingly parallel).
```{r loo, eval=FALSE}
MAX_CORES <- 30
goby.loo <- function(data){
pbmclapply(1:nrow(data), function(i){ # parallel lapply
fit.train <- fit.goby(data[-i,]) # fits each model to training set
sapply(fit.train, function(m) nll.pred(m, data[i,]))*2 # predicts to test point
}, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble # combines lpo estimates
}
m.loo.pw <- goby.loo(GobySurvival) # 28 secs using 30 cores
```
The tibble `m.loo.pw` has a column for each model and a row for each data point
```{r}
sample_n(m.loo.pw,5)
```
An initial look at the loo scores shows that model `qd` has the lowest estimate
```{r}
m.loo <- m.loo.pw %>% map_dbl(sum) # sum over columns to get the loo score (deviance)
m.min <- m.loo %>% which.min # store lowest-scoring model for later use
m.loo %>% {.-min(.)} # delta scores
```
```{r, echo=FALSE}
m.boot <- readRDS("goby_boot.rds")
```
## Applying the OSE rule
To estimate the uncertainty and correlation of the loo estimates, we compute the deviance for a set of (non-parametric) bootstrap samples:
```{r, eval = FALSE}
goby.boot <- function(data, nboot){
pbmclapply(1:nboot, function(i){
fitted <- F
while (!fitted){ # use try within while in case a model fails to fit for a particular bootstrap sample
GS.boot = try(fit.goby(sample_n(data,nrow(data), replace = T)))
fitted = sum(sapply(GS.boot, is.character))==0 # test if GS.boot contains an error message
}
sapply(GS.boot, function(x) logLik(x)*-2)
}, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble
}
m.boot <- goby.boot(GobySurvival, 1000) # 65 secs using 30 cores
```
The tibble `m.boot` contains a column for each model and a row for each bootstrap sample
```{r boot view}
sample_n(m.boot,5)
```
We use `m.boot` to compute both the standard errors and modified (i.e., correlation-adjusted) standard errors---there is a huge difference between the two due to significant correlation.
```{r}
se_ose <- m.boot %>% map_dbl(sd)
se_mod <- se_ose[m.min]*sqrt(1-cor(m.boot))[m.min,]
se_ose; se_mod
# se_ose[m.min]*sqrt(1-cor(m.loo*100))[m.min,]
```
To plot the results and apply the OSE rule, we make a summary table that can be passed to some pre-written plotting functions
```{r summary table}
# create model summary table
summaryTable <- tibble(model = m.names,
dim = m.dims[model],
score = m.loo[model],
delScore = score - min(score),
se_ose = se_ose[model],
se_mod = se_mod[model]) %>%
arrange(dim) %>%
mutate(index = 1:length(dim))
summaryTable
```
Finally, we can plot the resuls
```{r OSE}
source("OSE_functions.R")
ggarrange(plotOSE(summaryTable),
plotModifiedOSE(summaryTable),
common.legend = T, legend = "bottom")
```