-
Notifications
You must be signed in to change notification settings - Fork 4
/
DispersionExploration.Rmd
210 lines (151 loc) · 11.4 KB
/
DispersionExploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "Dispersion vs. Control"
output: html_document
author:
- Kyra Grantz
- C. Jessica E. Metcalf
- Justin Lessler
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
require(gridExtra)
country.dat <- read.csv("data/WHOSitrepTable2_20200211.csv")
country.dat <- country.dat %>%
mutate(total.cases = confirmed - cases.unknown) %>%
filter(total.cases>0)
```
**NOTE (2020-15-2):** Update at bottom based on numbers
as of February 15 2020.
Despite introductions of the novel coronavirus into 27 countries, as of February 11, 2020, there has been little documented onward transmission outside of China. At the start of the outbreak, the basic reproductive number, $R_0$, was estimated to be somewhere between 2 and 3 in Wuhan. That is, early growth features of the outbreak indicate that, on average, one infected individual infects between 2 and 3 other individuals. However, this same rate of epidemic growth has not been observed outside of China (and perhaps even in other provinces within China) despite numerous introductions.
These apparently contradictory observations can be reconciled in two ways: (A) onward transmission of the virus is less likely outside of China, presumably due to case finding paired with isolation and quarantine, i.e. the effective reproductive number $R_e$ is reduced; or (B) transmission of COVID-19 is in general overdispersed, i.e., the majority of transmission is due to a few superspreading events, while the vast majority of infected individuals do not transmit the virus. Perhaps most likely is that we are seeing some combination of these two effects.
By examining the number of introductions that have failed to result in onward transmission, we can get a sense of how extreme each of these effects has to be, and how they might work in combination, to produce the observed results. Following [Lloyd-Smith et al., 2005](https://www.nature.com/articles/nature04153), we assume that the number of secondary cases associated with an infectious individual is drawn from a negative binomial distribution with population mean $R_e$ and dispersion parameter $\theta$, where this distribution encodes individual characteristics of contact, environment, etc., that might modulate individual onward transmission. We estimate a range of $R_e$ and $\theta$ values consistent with the observed number of secondary transmission events outside of China to reconcile the seemingly contradictory observations of epidemic growth.
#### Data
We use data from the [WHO Situation Report](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports) on the number of assumed imported (n=`r sum(country.dat$cases.travel)`) and local cases (n=`r sum(country.dat$cases.outside)`) in `r length(unique(country.dat$Country))` countries to simulate an expected number of secondary infections caused by each known infection within country, assuming a uniform distribution of onward transmission events across all known cases. We repeat this simulation process 10,000 times to model the uncertainty in the data.
```{r, message=FALSE, warning=FALSE}
nreps = 10000
## Generate datasets from known number of cases (total + assumed local transmission in each country)
sim_data <- function(nreps, ntotal.vec, nlocal.vec){
tmp <- matrix(rep(0, nreps), nrow=1)
for(i in 1:length(ntotal.vec)){
tmp <- rbind(tmp, rmultinom(nreps, nlocal.vec[i], rep(1/ntotal.vec[i], ntotal.vec[i])))
}
return(tmp[-1,])
}
tmp.dat <- sim_data(nreps, country.dat$total.cases, country.dat$cases.outside)
```
#### Dispersion parameter estimation
We then find the optimal dispersion parameter, $\theta$, that best makes each assumed simulated data set consistent with a random value of $R_e$ drawn from a plausible range (0.1 - 3). Individual variation in infectiousness implies outbreaks are rarer but more explosive. Interpreting the $\theta$ parameter is eased by framing it in terms of the fraction of individuals responsible for 80\% of onward transmission (by analogy with [the 20/80 rule](https://www.pnas.org/content/94/1/338)).
Key functions used in the analysis:
```{r, message=FALSE, warning=FALSE}
## likelihood function for theta to optimize
to_optim <- function(theta, R0, data) {
-sum(dnbinom(x=data, size=theta, mu=R0, log = TRUE))
}
## optimization function
do_optim <- function(R0, data) {
return(optimize(to_optim,c(0,5000),R0=R0, data=data)$minimum)
}
## Calculate proportion to infect some percentage of the population given
## Re and theta.
getPropInfected <- Vectorize(function(Re,theta,target.prop=0.80) {
max.x <- 5000
tmp <- dnbinom(1:max.x, mu=Re, size=theta) * 1:max.x
tmp2 <- rev(cumsum(rev(tmp)))/sum(tmp)
rc<-1-pnbinom(which.min(abs(tmp2-target.prop)), mu=Re, size=theta)
return(rc)
})
```
```{r echo=FALSE, fig.height=6, fig.width=5, message=FALSE, warning=FALSE, fig.align="center"}
# fitting to simulated data
#Now wrappering in functions. First generate data
make_dat <- function(nreps, tmp.dat) {
# theta MLE for each pair of simulated data + randomly drawn Re
Rvec <- runif(nreps, 0.1, 3)
tmp <- vector(length=nreps)
tmp_sars <- vector(length=nreps)
tmp_R2 <- vector(length=nreps)
for(i in 1:nreps){
tmp[i] <- do_optim(Rvec[i], data = tmp.dat[,i]) #optimal theta for random R
tmp_R2[i] < do_optim(2, data=tmp.dat[,i]) # optimal theta for R=2
}
dat <- data.frame(Rvec,
theta = tmp,
p80 = getPropInfected(Rvec, tmp))
return(dat)
}
dat<-make_dat(nreps, tmp.dat)
make_figs <- function(dat) {
p1 <- ggplot(dat%>%mutate(theta=ifelse(theta>10,10, theta)) , aes(x=Rvec, y=theta)) +
geom_point(col='steelblue2', alpha=0.1) +
geom_smooth(col='black') +
scale_x_log10() +
scale_y_log10() +
theme_bw() +
geom_hline(yintercept = .16, lty=2, color="red" ) +
geom_rect(data=dat[1,], aes(xmin=2,xmax=3, ymin=0.01, ymax=10),fill="black", alpha=.2) +
xlab(expression(R[e])) +
ylab(expression(theta))
p2 <- ggplot(dat, aes(x=Rvec, y=p80)) +
geom_point(col='steelblue2', alpha=0.1) +
geom_smooth(col='black') +
scale_x_log10() +
scale_y_log10() + theme_bw() +
geom_rect(data=dat[1,], aes(xmin=2,xmax=3, ymin=0.001, ymax=0.2),fill="black", alpha=.2) +
ylab("Proportion infecting 80%") +
xlab(expression(R[e]))
return(grid.arrange(p1, p2,
ncol=1))
}
make_figs(dat)
```
**Figure** (top) The optimal $\theta$ versus $R_e$ over 10,000 simulated data sets and (bottom) the
proportion responsible for 80% of onward transmission. Red dashed horizontal line represents the dispersion
estimated for SARS in Singapore, and the shaded region represents a rough range of plausible
$R_e$ for the outbreak of COVID-19 in Wuhan.
```{r, echo=FALSE}
inf80_R2to3 <- dat %>% filter(Rvec>=2 & Rvec<=3)
inf80_Rp5to1 <- dat %>% filter(Rvec>=0.5 & Rvec<=1)
```
We find a range of possible values of over-dispersion consistent with the observed data at $R_e = 2$, as well as reduced mean $R_e$ values. We estimate that, if $R_e$ outside of China were between 2 and 3, as estimated within Wuhan, just
`r sprintf("%1.1f%% (range %1.1f, %1.1f )", mean(inf80_R2to3$p80)*100, min(inf80_R2to3$p80)*100, max(inf80_R2to3$p80)*100)`
of all infections would be responsible for 80% of onward transmission
events. On average, the estimated dispersion parameter $\theta$ is less than that estimated for SARS in 2003, indicating that there may be more skew in the distribution of individual $R_e$ for COVID-2019 than for SARS. If $R_e$ was reduced to just below 1, in the 0.5-0.1 range, then
`r sprintf("%1.1f%% (range %1.1f, %1.1f )", mean(inf80_Rp5to1$p80)*100, min(inf80_Rp5to1$p80)*100, max(inf80_R2to3$p80)*100)`
would be responsible for 80% of onward transmission.
#### Conclusions
This analysis indicates the relative lack of observed onward transmission following introductions of COVID-19 outside of China is consistent with moderate levels of over-dispersion and $R_e$ in the range of what has been observed in Wuhan ($R_e$ = 2-3). This suggests we should be cautious about assuming the relative lack of COVID-19 transmission outside of China is the result of effective control measures, or some other fundamental difference in COVID-19 transmission outside of Wuhan (and China more broadly).
These results are driven by the near absence of secondary infections emerging from international infections, and there may be a number of onward transmissions that are unobserved, in which cases this analysis could be significantly underestimating the portion that do transmit. As further data emerges in the coming weeks, this could help identify which scenario is in play.
Parameters governing transmissibility, including $R_e$ and $\theta$, have important implications for the effective surveillance and control of a pathogen. Modeling work ( [Hellewell et al., 2020](https://www.medrxiv.org/content/10.1101/2020.02.08.20021162v1)) has found that COVID-19 may be more challenging to control when there is less over-dispersion (higher $\theta$), and targeted control activities become more efficient as over-dispersion increases ([Lloyd-Smith et al., 2005](https://www.nature.com/articles/nature04153)). Our analysis may be useful in thinking about which estimates of mean $R_e$ and dispersion $\theta$ are plausible as we try to understand the epidemiology of COVID-19, and in reconciling apparent inconsistencies between how the virus is spreading in Wuhan and elsewhere in the world.
All code and data available at: [https://github.com/HopkinsIDD/nCoV-Sandbox](https://github.com/HopkinsIDD/nCoV-Sandbox).
#### Update 2020-02-15
Rerunning analysis based on 2020-02-15 data. Thanks
to [Christopher S. Penn](https://twitter.com/cspenn?s=20)
for sending along updated data.
```{r echo=FALSE, fig.height=6, fig.width=5, message=FALSE, warning=FALSE, fig.align="center"}
## Read in updated country data.
country.dat2 <- read.csv("data/WHOSitrepTable2_20200215.csv")
country.dat2 <- country.dat2 %>%
mutate(total.cases = confirmed - cases.unknown) %>%
filter(total.cases>0)
tmp.dat2<- sim_data(nreps, country.dat2$total.cases, country.dat2$cases.outside)
dat2 <- make_dat(nreps, tmp.dat2)
make_figs(dat2)
inf80_R2to3_2 <- dat2 %>% filter(Rvec>=2 & Rvec<=3)
inf80_Rp5to1_2 <- dat2 %>% filter(Rvec>=0.5 & Rvec<=1)
```
**Updated Figure** (top) The optimal $\theta$ versus $R_e$ over 10,000 simulated data sets and (bottom) the
proportion responsible for 80% of onward transmission. Red dashed horizontal line represents the dispersion
estimated for SARS in Singapore, and the shaded region represents a rough range of plausible
$R_e$ for the outbreak of COVID-19 in Wuhan.
Based on the updated data, e estimate that, if $R_e$ outside of
China were between 2 and 3, as estimated within Wuhan, just
`r sprintf("%1.1f%% (range %1.1f, %1.1f )", mean(inf80_R2to3_2$p80)*100, min(inf80_R2to3_2$p80)*100, max(inf80_R2to3_2$p80)*100)`
of all infections would be responsible for 80% of onward transmission
events (compared to `r sprintf("%1.1f%%", mean(inf80_R2to3$p80)*100)`
with data as of February 11).
Likewise, with updated data, if $R_e$ was reduced to just below 1,
in the 0.5-0.1 range, then
`r sprintf("%1.1f%% (range %1.1f, %1.1f )", mean(inf80_Rp5to1_2$p80)*100, min(inf80_Rp5to1_2$p80)*100, max(inf80_R2to3_2$p80)*100)`
would be responsible for 80% of onward transmission
(compared to `r sprintf("%1.1f%%", mean(inf80_Rp5to1$p80)*100)`).