The analysis, code, Shiny App and model were developed and implemented by Alexis Robert, Lloyd Chapman, Adam Kucharski, and Sebastian Funk (Centre for Mathematical Modelling of Infectious Diseases, London School and Hygiene and Tropical Medicine), and Bastian Prasse, Frank Sandmann, Rene Niehus, and Rok Grah (ECDC).
Backend code for RShiny App: Analysis of COVID-19 outbreak risk at subnational level in the vaccine era
This repository contains the script and functions to fit data on COVID-19 outbreaks in various EU countries (currently France, Czechia, and Italy), generate 28-day case and death forecasts, and simulate the impact of changes in transmission on outbreak risks. The model is implemented using a distributed-lag version of the Endemic-Epidemic model (as described in the surveillance and hhh4addon R packages). In France and Czechia, the model implemented is age-stratified (following the approach of the hhh4contacts package). The specifics of the model in each country are detailed in the file backend_code_methods.pdf.
Clone/download this project onto your machine.
The following R packages are required to run the code:
* data.table
* qs
* readxl
* surveillance
* remotes
* hhh4addon
* hhh4contacts
* sf
* socialmixr
* ggplot2
* spdep
* abind
* ISOweek
* dplyr
* tidyr
and can be installed in R by running:
install.packages(c("data.table","qs","readxl",”surveillance”,”remotes”,”hhh4contacts”,”sf”,”socialmixr”,”ggplot2”,”spdep”,”abind”,”ISOweek”,”dplyr”,”tidyr”))
hhh4addon
is installed when the code is run if it has not already been installed.
The analysis uses COVID-19 and demographic data from various sources. Some datasets, such as the case data, vaccination data and testing data, are by default downloaded in real-time as the code runs (see Table 1 in backend_code_methods.pdf for details of the sources for each country). Other datasets, such as the population data and contact data, are saved as static versions in the Data directory. It contains the following files:
- NUTS2021.xlsx: File with urban/rural status of NUTS-3 regions from Eurostat rural development methodology
- demo_r_d2jan_1_Data.csv: Population of EU countries in 2020 by year-of-age at different NUTS levels from Eurostat demographic data
- demo_r_pjangrp3_1_Data.csv: Population of EU countries in 2020 in 5-year age groups at different NUTS levels downloaded from Eurostat demographic data
- index.csv: Index file of names and location codes of different places in Google COVID-19 Open Data
- pop_struct.csv: 2020 NUTS-3-level population data for France downloaded from INSEE
- synthetic_contacts_2020.csv: Synthetic contact data for 172 countries from Prem et al 2021 downloaded from here.
- variant.csv: National-level variant frequency data for EU countries downloaded from Data on SARS-CoV-2 variants in the EU/EEA. This dataset is a static backup version used only if there are errors in the version downloaded when the code runs.
- File NUTS_RG_20M_2021_3035.shp: contains all shapefiles, downloaded from the Eurostat website, Copied in this folder because downloading the file in R can sometimes trigger errors.
To run the model and generate forecasts in the different countries: set the working directory to this repository, and run the command
source(“R/script_model/script_import_to_pred.R”)
The overall run time should be approximately 5 hours on a standard laptop with a 3.0 GHz processor and 32 GB RAM. This will generate the output files contained in the Output directory. These files should then be moved to the RShiny GitHub repository. Change the value of pred_date
(script_import_to_pred.R) to fit the model up to a more recent prediction date.
The predictions on the RShiny GitHub repository are automatically updated weekly using the scripts in the workflow folder and GitHub Actions. The workflow scripts are similar to script_import_to_pred.R, but are separated by the type of forecasts (workflow_fit to generate forecasts at previous and current dates, workflow_sim to generate forecasts according to different scenarios). They were separated to avoid going over the GitHub Action time limit.
To run the calibration analysis over the last six months of data, run the command:
source(“R/calibration/script_calib.R”)
This will generate various figures and tables summarising the calibration results, and comparing the predictions from our model to forecasts generated by the European COVID-19 Forecast Hub ensemble model. The figures and tables will be generated in different folders of the Output directory using the full model, an empty age-stratifed, a non-age-stratified version and an empty non-stratified version of the model. This highlights the impact of added complexity on the forecasting ablities of the model.
The framework developed in this repository can be applied to countries that report daily subnational case data (at NUTS-3 level) and subnational death data (at NUTS1-NUTS2 level), along with local vaccine and testing data. If the data sources are stratified by age groups, then an age-stratified version of the model can be implemented.
Instructions for how to integrate a new country in the analysis, how to add a new variant to the fitting section, and how to add alternative NPIs in the Scenario forecasts can be found in the sections below. We also briefly present what is contained in every script. Please contact us via email ([email protected]; [email protected]) or open an issue if you need assistance with editing the repository.
As most EU countries only have non-age-stratified subnational data available, we only provide detailed instructions for adding non-age-stratified forecasts here. All countries are identified by their two-letter ISO country code as defined in the country_code
variable in the index file of the Google COVID-19 Open Data, e.g. FR
for France.
The country-specific sections of code are indicated by a tag (## TAG COUNTRY:
). Use CTRL-SHIFT-F
(or CMD-SHIFT-F
on a Mac) to find all occurrences of the tag in the repository. The description, expected input and output format of these sections is described in the comment lines following the tag.
Modify the scripts and functions in the R directory as follows:
- functions_import_data/function_import_data.R
import_case()
:- Process the data so that it contains the following variables, by adding
else if
statements wherecase
andcase1
are defined:location_key
: NUTS-3 region codedate
: date in format YYYY-MM-DDnew_confirmed
: daily number of new confirmed casesnew_tested
: daily number of new individuals tested; this should beNA
if the testing data is missing from the case data sourcecumulative_confirmed
: daily cumulative number of confirmed casescumulative_tested
: daily cumulative number of individuals tested; this should beNA
if the testing data is missing from the case data sourceweek_day
: year and week number in the format YYYY-WW where WW is the week of the year as a decimal number (00-53), with the first Monday of the year as day 1 of week 1.
- Process the data so that it contains the following variables, by adding
import_death()
:- Import and process the data on the number of death per region so that it contains the following variables:
location_key
: NUTS-1/NUTS-2 region codedate
: date in format YYYY-MM-DDnew_deceased
: daily number of new deaths
- Import and process the data on the number of death per region so that it contains the following variables:
import_vacc()
:- Import the vaccine data from a country-specific URL (containing the number of daily cases, per 10 year age band), the output of this
if()
section should be a data frame called "vacc_regions", which contains the following variables:date_week
: date in format YYYY-MM-DD.number
: Region code.Population
: Number of inhabitants in this region + age group (If not reported, can be set to NA, for now it is only used in France).Age
: Age group described in this row.dose1
: Number of 1st dose distributed on that day / region / age group.dose2
: Number of 2nd dose distributed on that day / region / age group.dose3
: Number of booster dose distributed on that day / region / age group.
- Import the vaccine data from a country-specific URL (containing the number of daily cases, per 10 year age band), the output of this
import_pop()
:- If the population data source is country-specific, add an else if statement after
if (country == “FR”)
(i.e. after the Tag). - If the data source is Eurostat, add the two-letter country code to the vector of country codes (in
else if (country %in% c(“CZ”))
). - Process the data so that it has the following variables:
number
: part or whole NUTS-3 region code, in a format that can be matched withlocation_key
in thecase
data table, via theget_reg_nb()
function infunction_utils.R
population
: population of NUTS-3 region
- If
total == FALSE
, the data frame returned by the function should also contain the following variables (in addition to the two listed above):region
: region nameage
: age group
- If the population data source is country-specific, add an else if statement after
import_test()
:- If the testing data is not already present in the case data table, import it via the
import_test()
function. - If the data source for the testing data is country-specific, add an
if
statement inimport_test()
to import and process the testing data, otherwise the national-level total testing data from the ECDC testing database is used by default. The dataframe created in this if section should contain the following variablesdate
: date in format YYYY-MM-DDage
: age group (10 year age band)population
: Number of inhabitantsnb_tests
: Number of tests on that date / age group
- If the testing data is not already present in the case data table, import it via the
import_contact()
:- Import a contact matrix from a country-specific survey (like in France), or a multi-country analysis (like in Slovakia).
- The
if()
section must define a matrix namedC
(dimensions 9*9), which contains the number of contacts between the different age groups (0-10 years old; 10-20;...; 80+).
import_map()
:- Create column
reg_nb
in map1, which corresponds to the reference of each subnational area in theindex
database (created inimport_index()
). To do so, first remove overseas territories (if any), select the entries of interest inindex
, set a key inindex
, and add the subregion code tomap1
(in a column calledreg_nb
).
- Create column
import_all_files()
:- If the population data is taken from the google database, set
google <- T
. - If vaccine data is taken from the ECDC vaccine database, set
vacc_ecdc <- T
. - If the testing data is included in the case database (imported with
import_case()
), settest = NULL
(e.g. France). - Otherwise, if the testing data comes from a national database (age-stratified), use
import_test()
withtotal = F
(e.g. Czechia). - Otherwise, to use the ECDC testing database, run
import_test()
withtotal = T
(e.g. Italy)
- If the population data is taken from the google database, set
- functions_model_sim/function_covariates.R:
dow_cov()
:- Define the vector
bank_holiday
, which contains the dates of all bank holidays in the country in 2021 and 2022.
- Define the vector
all_covariates()
:- When computing the proportion of the population tested in the past two weeks (calling
test_by_age_cov
), if the testing data is included in the incidence dataset, usedt_incidence
as an argument when calling the functiontest_by_age_cov
, otherwise usedt_test
.
- When computing the proportion of the population tested in the past two weeks (calling
- functions_import_data/function_adjust_age_group.R:
adjust_age_group()
:- Create a data table named
vacc_long
, which contains the proportion of the population per region / age group who received each dose of vaccine (on that day and cumulative).vacc_long
contains the following variables:date_week
: date in format YYYY-MM-DDregion_nb
: Region codeage
: Age group described in this rowdose
: Dose numberprop_dose
: Proportion of the population vaccinated that daycumu_dose
: Cumulative proportion of the population vaccinated
- Create a data table named
- function_utils.R:
- Add an
if()
section to functionsget_reg_nb
,get_reg_nb
,get_age
,get_nuts2_reg
, andget_nuts3_reg
. These sections are used to extract the region number, region-age, age, NUTS-2 code, and NUTS-3 code from Google COVID-19 Open Datalocation_key
.
- Add an
- functions_model_sim/function_calculate_cfr.R:
calculate_cfr()
: If the spatial granularity of case and death data differs, aggregate the case data (objectcase
) at NUTS-2 or NUTS-1 level (example for Italy and France).
A new variant, X
say, can be added to the model by:
- Adding a line in the
import_variant()
function in function_import_data.R to set which PANGO variant codes in the ECDC variant data should be coded asvariant = X
in thevariant
data table. - Adding a binary matrix
X_mat
in thevariant_covariate()
function in function_covariates.R with 1s whenX
was above a certain proportion of cases,prop_X
, and addingprop_X
as an argument ofvariant_covariate()
with a default value. X_mat
should be a matrix with the same dimensions asobserved
in thests
object with identical columns for the different region(-age) groups (as the variant data is only available at a national level).- Adding
X
to the epidemic termsne_terms
in the model equations (in function_generate_list_output.R).
New scenarios can be generated by adding new elements to the vector target
in the script function_generate_list_output.R. The age group(s) included in this new element should be defined in the following for
loop, using the vector cols
(similar to the other elements of the target
vector). For instance:
if (target[l] == "children") {
cols <- c(grep("0-9", colnames(model_fit$stsObj@observed)), grep("10-19", colnames(model_fit$stsObj@observed)))
}
Bear in mind that adding scenarios will make the output dataset larger, eventually slowing down the first display of the figures in the Shiny App.
Directory structure:
├── .github
│ ├── workflows
│ │ ├── copy_shiny.yml
│ │ └── run.yml
├── Data
│ ├── NUTS2021.xlsx
│ ├── NUTS_RG_20M_2021_3035.shp.zip
│ ├── demo_r_d2jan_1_Data.csv
│ ├── demo_r_pjangrp3_1_Data.csv
│ ├── index.csv
│ ├── pop_struct.csv
│ ├── source_NUTS.txt
│ ├── synthetic_contacts_2020.csv
│ └── variant.csv
├── LICENSE
├── Output
│ ├── output_model_CZ.RDS
│ ├── output_model_FR.RDS
│ └── output_model_IT_total.RDS
├── R
│ ├── calibration
│ │ ├── figures_calib.R
│ │ ├── function_main_figure.R
│ │ ├── script_calib.R
│ │ └── function_calib.R
│ ├── function_utils.R
│ ├── functions_custom_hhh4
│ │ ├── function_interpretControl.R
│ │ └── function_simulate.R
│ ├── functions_import_data
│ │ ├── function_adjust_age_group.R
│ │ ├── function_cumulative_incidence.R
│ │ ├── function_import_data.R
│ │ ├── import_libraries.R
│ │ └── import_scripts.R
│ ├── functions_model_sim
│ │ ├── function_calculate_cfr.R
│ │ ├── function_covariates.R
│ │ ├── function_create_sts.R
│ │ ├── function_format_pred.R
│ │ ├── function_models.R
│ │ ├── function_n_step_ahead.R
│ │ └── function_predict_covariates.R
│ ├── script_model
│ │ ├── function_generate_list_output.R
│ │ └── script_import_to_pred.R
│ └── workflow
│ ├── workflow_fit.R
│ └── workflow_sim.R
├── README.md
└── backend_code_methods.pdf
The scripts in the R folder contain the following:
function_utils.R
: General utility functions used by multiple functions/scripts, e.g. functions for extracting the region numbers from the Googlelocation_key
functions_custom_hhh4/
function_interpretControl.R
: Custom version ofinterpretControl()
function from hhh4addon package that enables simulation of age-structured distributed-lag models with separate autoregressive and neighbourhood componentsfunction_simulate.R
: Custom version ofsimulate.hhh4lag()
function from hhh4addon package to enable simulation of distributed-lag age-stratified models with separate autoregressive and neighbourhood components and to account for cumulative incidence being a covariate in autoregressive and neighbourhood predictors
functions_import_data/
function_adjust_age_group.R
: Function for calculating vaccination coverage in age groups in case data and population datafunction_cumulative_incidence.R
: Function for calculating the cumulative incidence per age group (if model is age-stratified), per region, during each variant wave (wild-type + Alpha, Delta, Omicron)function_import_data.R
: Functions for importing the datasets required to fit the modelimport_libraries.R
: Script that loads the packages required for the analysisimport_scripts.R
: Script that sources all the files with functions required for the analysis
functions_model_sim/
function_calculate_cfr.R
: Functions to compute Case Fatality Ratios from recent weeks ; Create a data frame containing the number of weekly deaths per region ; Implement regression models to predict changes in Case fatality Ratios ; and generate up to four-week-ahead forecasts on the number of weekly deaths.function_covariates.R
: Functions for constructing the covariate matrices used in fitting the modelfunction_create_sts.R
: Function for creating thests
object used in fitting the model that accounts for whether the model is age-structured or not, and creates theneighbourhood
entry of thests
object accordingly. Adapted from thenoroBE()
function in hhh4contactsfunction_format_pred.R
: Function that generates a data frame of the forecasts in thelist_pred
object output fromnStepAhead()
function_models.R
: Function for fitting the (age-stratified) distributed-laghhh4
model that creates thecontrol
object from the model equations and covariate datafunction_n_step_ahead.R
: Function for forecasting casesn
days ahead with predicted values of covariates and different scenarios for variant transmissibility and importation levelfunction_predict_covariates.R
: Functions for projecting covariates for case forecasts with different projection types
calibration/
function_calib.R
: Functions to run the calibration analysis for some (or all) countries at a set of dates, and generate figures and tables describing the calibration results.figures_calib.R
: Functions to produce the calibration figures and tables per country.function_main_figure.R
: Functions to produce the calibration figures and tables comparing scores and performance in each country.script_calib.R
: Top-level script to run the calibration analysis and generate the figures and tables, for a set of models, dates and countries. The calibration files, figures, and tables will be added in the Output directory.
workflow/
workflow_fit.R
: Top-level script used in GitHub actions to produce case forecasts at current and previous dates.workflow_sim.R
: Top-level script used in GitHub actions to produce case forecasts in different transmission scenarios.
script_model/
function_generate_list_output.R
: Function for generating case forecasts for a specific country over a certain date rangescript_import_to_pred.R
: Top-level script for setting options and running code to produce case forecasts
- Alexis Robert: [email protected]
- Lloyd Chapman: [email protected]