goby.Rmd

---
title: "Goby survival"
author: "Luke Yates"
date: "24/11/2021"
bibliography: modSel.bib
link-citations: true
output: 
  html_document:
    number_sections: false
    toc: false 
    toc_float:
      collapsed: false
      smooth_scroll: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(emdbook) # version 1.3.12 
library(bbmle) # version 1.0.23.1
library(pbmcapply)
knitr::opts_chunk$set(echo = TRUE, dev="png", 
               dev.args=list(type="cairo"),
               dpi=96)
```

## Introduction

## Load data and specify models

The Goby survival data set contains measurements from experimental manipulations on *Elacatinus evelynae* and *E. prochilos* in the US Virgin Islands, 2000-2002 [@Wilson2004]. The data are compiled from five experiments, $x = 1,2,...,5$, collected over three years, and include information on animal density $d$ and habitat quality $q$. The fraction of surviving Gobies $T(t)$ are modelled using the Weibull distribution; the aim of the analysis is to investigate the effect of density and habitat quality on mortality rate. Following the analysis of [@Bolker2008] (p276-283), we consider a set of 10 candidate models characterised by the dependence of the scale parameter $s$ on the variables $x$, $d$ and $q$. The most complex model, denoted ($\mbox{xqdi}$), is

\begin{eqnarray*}
    &&T \sim Weibull(a,s_x(d,q))\\[2mm]
    &&s_x(d,q) = exp(\alpha_x + \beta q + (\gamma + \delta q)d), 
\end{eqnarray*}

with $s$ depending on $x,q,d,$ and an interaction $i$, between $q$ and $d$. Models of lower complexity, with various dependencies omitted, are labelled in the same manner, together with ($\mbox{zero}$), which denotes the simplest model (a single shape and scale parameter). The data set contains 369 rows. 

We load the data from the R package `emdbook` (with thanks to Jackie Wilson for giving permission for their use) and make a couple of small preparations  

```{r load data}
library(emdbook)
data(GobySurvival)

GobySurvival$day1 <-  GobySurvival$d1-1
GobySurvival$day2 = ifelse(GobySurvival$d2==70,Inf,GobySurvival$d2-1)
```

We load a script containing functions to specify, compute and (numerically) optimise the model likelihoods (see GitHub repo). 

```{r load goby functions}
source("goby_functions.R")
# nll.xqdi computes negative log-likelihood of the full model given data
# nll.pred computes predictive negative log-likelihood for given fitted model and test data
# fit.goby fits all 10 goby survival models using MLE, given data
```
For example, the function `nll.xqdi` returns the negative log-likelihood of the full model ($\mbox{xqdi}$):
```{r}
nll.xqdi
```
The function `fit.goby` specifies all 10 models (by fixing subset of the paramters to 0) and fits them using the numerical optimiser `bbmle::mle2`, returning a list of the fitted objects. For example, we use the function to fit all models to the full data set and print the fitted object for the ($\mbox{qd}$) model:
```{r}
m.fits <- fit.goby(GobySurvival)
m.fits$qd 
```
For later use, it will be handy to have the models' names and dimensions
```{r}
m.dims <- sapply(m.fits, function(m) attr(logLik(m),"df"))
m.names <- names(m.fits)
```

## Apply LOO 
```{r load loo, echo = FALSE}
m.loo.pw <- readRDS("goby_loo.rds")
```

To apply leave-one-out CV, we define the following function which uses `pbmclapply`---a parallel implementation of `lapply`---to compute the pointwise predictive log-likelihood (`nll.pred` function) for all 369 LOO data splits for all 10 models. Using 30 cores, computation time is reduced from 15 mins to just 30 secs (i.e., embarrassingly parallel).
```{r loo, eval=FALSE}
MAX_CORES <- 30

goby.loo <- function(data){
  pbmclapply(1:nrow(data), function(i){ # parallel lapply
    fit.train <- fit.goby(data[-i,]) # fits each model to training set
    sapply(fit.train, function(m) nll.pred(m, data[i,]))*2 # predicts to test point
  }, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble # combines lpo estimates
}

m.loo.pw <- goby.loo(GobySurvival) # 28 secs using 30 cores

```
The tibble `m.loo.pw` has a column for each model and a row for each data point
```{r}
sample_n(m.loo.pw,5)
```

An initial look at the loo scores shows that model `qd` has the lowest estimate
```{r}
m.loo <- m.loo.pw %>% map_dbl(sum) # sum over columns to get the loo score (deviance)
m.min <- m.loo %>% which.min # store lowest-scoring model for later use
m.loo %>% {.-min(.)} # delta scores
```


```{r, echo=FALSE}
m.boot <- readRDS("goby_boot.rds")
```
## Applying the OSE rule

To estimate the uncertainty and correlation of the loo estimates, we compute the deviance for a set of (non-parametric) bootstrap samples:
```{r, eval = FALSE}
goby.boot <- function(data, nboot){
  pbmclapply(1:nboot, function(i){
    fitted <- F
    while (!fitted){ # use try within while in case a model fails to fit for a particular bootstrap sample 
      GS.boot = try(fit.goby(sample_n(data,nrow(data), replace = T)))
      fitted = sum(sapply(GS.boot, is.character))==0  # test if GS.boot contains an error message
    }
    sapply(GS.boot, function(x) logLik(x)*-2)
  }, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>%  as_tibble
}

m.boot <- goby.boot(GobySurvival, 1000) # 65 secs using 30 cores 
```
The tibble `m.boot` contains a column for each model and a row for each bootstrap sample
```{r boot view}
sample_n(m.boot,5)
```

We use `m.boot` to compute both the standard errors and modified (i.e., correlation-adjusted) standard errors---there is a huge difference between the two due to significant correlation.

```{r}
se_ose <- m.boot %>% map_dbl(sd)
se_mod <- se_ose[m.min]*sqrt(1-cor(m.boot))[m.min,]
se_ose; se_mod
# se_ose[m.min]*sqrt(1-cor(m.loo*100))[m.min,]
```
To plot the results and apply the OSE rule, we make a summary table that can be passed to some pre-written plotting functions
```{r summary table}

# create model summary table
summaryTable <- tibble(model = m.names, 
                       dim = m.dims[model],
                       score = m.loo[model],
                       delScore = score - min(score),
                       se_ose = se_ose[model],
                       se_mod = se_mod[model]) %>% 
                    arrange(dim) %>% 
                    mutate(index = 1:length(dim))
summaryTable
```
Finally, we can plot the resuls
```{r OSE}
source("OSE_functions.R")
ggarrange(plotOSE(summaryTable), 
          plotModifiedOSE(summaryTable), 
          common.legend = T, legend = "bottom")
```