analysis/latest.Rmd

---
title: Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations
knit: (function(input, ...) {
    rmarkdown::render(input,
                  output_format = bookdown::pdf_document2(fig_caption = TRUE,
                    keep_tex = TRUE, number_sections = FALSE, toc = FALSE,
                    extra_dependencies = "colortbl"),
                  output_dir = here::here("output"))
    })
output:
  bookdown::pdf_document2:
    extra_dependencies: ["colortbl"]
    toc: false
    number_sections: false
    keep_tex: true
  bookdown::word_document2: default
header-includes:
    - \usepackage{setspace}\doublespacing
bibliography: references.bib
link-citations: yes
csl: ieee.csl
always_allow_html: yes
---

```{r render, include=FALSE, eval=FALSE}
# Create this document
rmarkdown::render(input = here::here("analysis", "latest.Rmd"), 
                  output_format = bookdown::pdf_document2(fig_caption = TRUE),
                  output_dir = here::here("output"))
```

```{r set-up, include=FALSE}
# Packages
library(knitr)
library(here)
library(dplyr)
library(ggplot2)
library(kableExtra)

# Document settings
opts_chunk$set(eval = TRUE, echo = FALSE,
               message = FALSE, warning = FALSE,
               eval.after = "fig.cap")
options(scipen=1, digits=2)
theme_set(theme_bw())

# Set latest date for evaluation
eval_date <- as.Date("2022-03-07")

# Load data: evaluation scores
load_from_local <- TRUE # load from saved csv in "data" (FALSE starts download)
source(here("code", "load", "evaluation-scores.R"))

# Load summary numbers used in text
source(here("code", "summarise", "data-submissions.R"))
source(here("code", "summarise", "data.R"))
```

```{r authors, results='asis'}
# Get author list
source(here("code", "summarise", "metadata", "authorship.R"))
authors <- readLines(here("output", "metadata", "authors.txt"))
cat(authors)
```


```{r authors-institutions, results='asis', size ="small"}
institutions <- readLines(here("output", "metadata", "institutions.txt"))
cat(institutions)
# funding - names of funding bodies only
funding <- readLines(here("output", "metadata", "funding-bodies.txt"))
```

# Abstract

_Background:_ Short-term forecasts of infectious disease contribute to situational awareness and capacity planning. Based on best practice in other fields and recent insights in infectious disease epidemiology, one can maximise forecasts’ predictive performance by combining independent models into an ensemble. Here we report the performance of ensemble predictions of COVID-19 cases and deaths across Europe from March 2021 to March 2022. _Methods:_ We created the European COVID-19 Forecast Hub, an online open-access platform where modellers upload weekly forecasts for 32 countries with results publicly visualised and evaluated. We created a weekly ensemble forecast from the equally-weighted average across individual models’ predictive quantiles. We measured forecast accuracy using a baseline and relative Weighted Interval Score (rWIS). We retrospectively explored ensemble methods, including weighting by past performance. _Results:_ We collected weekly forecasts from `r sum_submissions$n_models` models, of which we evaluated `r modellers$n_model` models alongside the ensemble model. The ensemble had a consistently strong performance across countries over time, performing better on rWIS than `r hub_scores$wis_vs_models$Deaths$p_better`% of forecasts for deaths (N=`r hub_scores$wis_vs_models$Deaths$n` predictions from `r hub_scores$wis_vs_models$Deaths$models` models), and `r hub_scores$wis_vs_models$Cases$p_better`% forecasts for cases (N=`r hub_scores$wis_vs_models$Cases$n` predictions from `r hub_scores$wis_vs_models$Cases$models` models). Performance remained stable over a 4-week horizon for death forecasts but declined with longer horizons for cases. Among ensemble methods, the most influential choice came from using a median average instead of the mean, regardless of weighting component models. _Conclusions:_ Our results support combining independent models into an ensemble forecast to improve epidemiological predictions, and suggest that median averages yield better performance than methods based on means. We highlight that forecast consumers should place more weight on incident death forecasts than case forecasts at horizons greater than two weeks. _Funding:_ `r funding`

# Background

Epidemiological forecasts make quantitative statements about a disease outcome in the near future. Forecasting targets can include measures of prevalent or incident disease and its severity, for some population over a specified time horizon. Researchers, policy makers, and the general public have used such forecasts to understand and respond to the global outbreaks of COVID-19 [@basshuysenThreeWaysWhich2021; @cdcCoronavirusDisease20192020; @europeancentrefordiseasepreventionandcontrolForecastingCOVID19Cases2021]. At the same time, forecasters use a variety of methods and models for creating and publishing forecasts, varying in both defining the forecast outcome and in reporting the probability distribution of outcomes [@zelnerAccountingUncertaintyPandemic2021; @jamesUseMisuseMathematical2021]. 

Within Europe, comparing forecasts across both models and countries can support a range of national policy needs simultaneously. European public health professionals operate across national, regional, and continental scales, with strong existing policy networks in addition to rich patterns of cross-border migration influencing epidemic dynamics. A majority of European countries also cooperate in setting policy with inter-governmental European bodies such as the European Centre for Disease Prevention and Control (ECDC). In this case, a consistent approach to forecasting across the continent as a whole can support accurately informing cross-European monitoring, analysis, and guidance [@europeancentrefordiseasepreventionandcontrolForecastingCOVID19Cases2021]. At a regional level, multi-country forecasts can support a better understanding of the impact of regional migration networks. Meanwhile, where there is limited capacity for infectious disease forecasting at a national level, forecasters generating multi-country results can provide an otherwise-unavailable opportunity for forecasts to inform national situational awareness. Some independent forecasting models have sought to address this by producing multi-country results  [@aguasModellingCOVID19Pandemic2020;@adibParticipatoryModellingApproach2021;@Agosto2020;@agostoMonitoringCOVID19Contagion2021]. 

Variation in forecast methods and presentation makes it difficult to compare predictive performance between forecast models, and from there to derive objective arguments for using one forecast over another. This confounds the selection of a single representative forecast and reduces the reliability of the evidence base for decisions based on forecasts. A “forecast hub” is a centralised effort to improve the transparency and usefulness of forecasts, by standardising and collating the work of many independent teams producing forecasts [@reichCollaborativeMultiyearMultimodel2019]. A hub sets a commonly agreed-upon structure for forecast targets, such as type of disease event, spatio-temporal units, or the set of quantiles of the probability distribution to include from probabilistic forecasts. For instance, a hub may collect predictions of the total number of cases reported in a given country for each day in the next two weeks. Forecasters can adopt this format and contribute forecasts for centralised storage in the public domain. 

This shared infrastructure allows forecasts produced from diverse teams and methods to be visualised and quantitatively compared on a like-for-like basis, which can strengthen public and policy use of disease forecasts. The underlying approach to creating a forecast hub was pioneered in climate modelling and adapted for collaborative epidemiological forecasts of dengue [@johanssonOpenChallengeAdvance2019] and influenza in the USA [@reichCollaborativeMultiyearMultimodel2019; @reichAccuracyRealtimeMultimodel2019]. This infrastructure was adapted for forecasts of short-term COVID-19 cases and deaths in the US [@cramerUnitedStatesCOVID192021; @rayEnsembleForecastsCoronavirus2020e], prompting similar efforts in some European countries [@bracherPreregisteredShorttermForecasting2021; @funkShorttermForecastsInform2020; @bicherSupportingCOVID19PolicyMaking2021].

Standardising forecasts allows for combining multiple forecasts into a single ensemble with the potential for an improved predictive performance. Evidence from previous efforts in multi-model infectious disease forecasting suggests that forecasts from an ensemble of models can be consistently high performing compared to any one of the component models [@reichAccuracyRealtimeMultimodel2019; @johanssonOpenChallengeAdvance2019; @viboudRAPIDDEbolaForecasting2018]. Elsewhere, weather forecasting has a long-standing use of building ensembles of models using diverse methods with standardised data and formatting in order to improve performance [@buizzaIntroductionSpecialIssue2019; @moranEpidemicForecastingMessier2016].

The European COVID-19 Forecast Hub [@europeancovid-19forecasthubEuropeanCOVID19Forecast2021] is a project to collate short term forecasts of COVID-19 across 32 countries in the European region. The Hub is funded and supported by the ECDC, with the primary aim to provide reliable information about the near-term epidemiology of the COVID-19 pandemic to the research and policy communities and the general public [@europeancentrefordiseasepreventionandcontrolForecastingCOVID19Cases2021]. Second, the Hub aims to create infrastructure for storing and analysing epidemiological forecasts made in real time by diverse research teams and methods across Europe. Third, the Hub aims to maintain a community of infectious disease modellers underpinned by open science principles.

We started formally collating and combining contributions to the European Forecast Hub in March 2021. Here, we investigate the predictive performance of an ensemble of all forecasts contributed to the Hub in real time each week, as well as the performance of variations of ensemble methods created retrospectively.

# Methods

We developed infrastructure to host and analyse prospective forecasts of COVID-19 cases and deaths. The infrastructure is compatible with equivalent research software from the US [@cramerReichlabCovid19forecasthubRelease2021; @wangReichlabCovidHubUtilsRepository2021] and German and Polish COVID-19 [@bracherGermanPolishCOVID192020] Forecast Hubs, and easy to replicate for new forecasting collaborations. All data and code for this analysis are publicly available on Github [@PredictivePerformanceMultimodel2022].

### Forecast targets and models

We sought forecasts for the incidence of COVID-19 as the total reported number of cases and deaths per week. We considered forecasts for 32 countries in Europe, including all countries of the European Union, European Free Trade Area, and the United Kingdom. We compared forecasts against observed data reported for each country by Johns Hopkins University (JHU, [@dongInteractiveWebbasedDashboard2020]). JHU data sources included a mix of national and aggregated subnational data. We aggregated incidence over the Morbidity and Mortality Weekly Report (MMWR) epidemiological week definition of Sunday through Saturday.

Teams could express their uncertainty around any single forecast target by submitting predictions for up to 23 quantiles (from 0.01 to 0.99) of the predictive probability distribution. Teams could also submit a single point forecast. At the first submission we asked teams to add a pre-specified set of metadata briefly describing the forecasting team and methods (provided online and in supplementary information). No restrictions were placed on who could submit forecasts. To increase participation we actively contacted known forecasting teams across Europe and the US and advertised among the ECDC network. Teams submitted a broad spectrum of model types, ranging from mechanistic to empirical models, agent-based and statistical models, and ensembles of multiple quantitative or qualitative models (described at [@europeancovid-19forecasthubCommunity]). We maintain a full project specification with a detailed submissions protocol [@europeancovid-19forecasthubCovid19forecasthubeuropeWiki].

We collected forecasts submitted weekly in real time over the `r hub$n_weeks` week period from `r hub$start_date` to `r hub$end_date`. Teams submitted at latest two days after the complete dataset for the latest forecasting week became available each Sunday. We implemented an automated validation programme to check that each new forecast conformed to standardised formatting. Forecast validation ensured a monotonic increase of predictions with each increasing quantile, integer-valued non-negative counts of predicted cases, as well as consistent date and location definitions.

Each week we used all available valid forecasts to create a weekly real-time ensemble model (referred to as "the ensemble" from here on), for each of the 256 possible forecast targets: incident cases and deaths in 32 locations over the following one through four weeks. The ensemble method was an unweighted average of all models' forecast values, at each predictive quantile for a given location, target, and horizon. From `r hub$start_date`, we used the arithmetic mean. However we noticed that including highly anomalous forecasts in a mean ensemble produced extremely wide uncertainty. To mitigate this, from 26^th^ July 2021 onwards the ensemble instead used a median of all predictive quantiles.

We created an open and publicly accessible interface to the forecasts and ensemble, including an online visualisation tool allowing viewers to see past data and interact with one or multiple forecasts for each country and target for up to four weeks’ horizon [@europeancovid-19forecasthubEuropeanCovid19Forecast]. All forecasts, metadata, and evaluations are freely available and held on Github [@europeancovid-19forecasthubEuropeanCOVID19Forecast2021] (archived in real-time at [@katharine_sherratt_2022_7356267]), and Zoltar, a platform for hosting epidemiological forecasts [@epiforecastsProjectECDCEuropean2021; @reichZoltarForecastArchive2021]. In the codebase for this study [@PredictivePerformanceMultimodel2022] we provide a simple method and instructions for downloading and preparing these data for analysis using R. We encourage other researchers to freely use and adapt this to support their own analyses.

### Forecast evaluation

In this study we focused only on the comparative performance of forecasting models relative to each other. Performance in absolute terms is available on the Hub website [@europeancovid-19forecasthubEuropeanCovid19Forecast]. For each model, we assessed calibration and overall predictive performance. We evaluated all previous forecasts against actual observed values for each model, stratified by the forecast horizon, location, and target. We calculated scores using the _scoringutils_ R package [@nikosibosseScoringutilsUtilitiesScoring2020]. We removed any forecast surrounding (both the week of, and the first week after) a strongly anomalous data point. We defined anomalous as where any subsequent data release revised that data point by over 5%.

To investigate calibration we assessed coverage as the correspondence between the forecast probability of an event and the observed frequency of that event. This usage follows previous work in epidemic forecasting [@bracherEvaluatingEpidemicForecasts2021], and is related to the concept of reliability for binary forecasts. We established the accuracy of each model's prediction boundaries as the coverage of the predictive intervals. We calculated coverage at a given interval level $k$, where $k\in[0,1]$, as the proportion $p$ of observations that fell within the corresponding central predictive intervals across locations and forecast dates. A perfectly calibrated model would have $p=k$ at all 11 levels (corresponding to 22 quantiles excluding the median). An underconfident model at level $k$ would have $p>k$, i.e. more observations fall within a given interval than expected. In contrast, an overconfident model at level $k$ would have $p<k$, i.e. fewer observations fall within a given interval than expected. We here focus on coverage at the $k=0.5$ and $k=0.95$ levels.

We also assessed the overall predictive performance of weekly forecasts using the Weighted Interval Score~(WIS) across all available quantiles. The WIS represents a parsimonious approach to scoring forecasts based on uncertainty represented as forecast values across a set of quantiles [@bracherEvaluatingEpidemicForecasts2021], and is a strictly proper scoring rule, that is, it is optimal for predictions that come from the data-generating model. As a consequence, the WIS encourages forecasters to report predictions representing their true belief about the future [@gneitingStrictlyProperScoring2007]. Each forecast for a given location and date is scored based on an observed count of weekly incidence, the median of the predictive distribution and the predictive upper and lower quantiles corresponding to the central predictive interval level.

Not all models provided forecasts for all locations and dates, and we needed to compare predictive performance in the face of various levels of missingness across each forecast target. Therefore we calculated a relative WIS. This is a measure of forecast performance which takes into account that different teams may not cover the same set of forecast targets (i.e., weeks and locations). The relative WIS is computed using a _pairwise comparison tournament_ where for each pair of models a mean score ratio is computed based on the set of shared targets. The relative WIS of a model with respect to another model is then the ratio of their respective geometric mean of the mean score ratios, such that smaller values indicate better performance.

We scaled the relative WIS of each model with the relative WIS of a baseline model, for each forecast target, location, date, and horizon. The baseline model assumes case or death counts stay the same as the latest data point over all future horizons, with expanding uncertainty, described previously in [@cramerEvaluationIndividualEnsemble2021]. In this study we report the relative WIS of each model with respect to the baseline model.

#### Retrospective ensemble methods

We retrospectively explored alternative methods for combining forecasts for each target at each week. A natural way to combine probability distributions available in the quantile format [@genestVincentizationRevisited1992] used here is 

$$F^{-1}(\alpha) = \sum_{i=1}^{n}w_i F_i^{-1}(\alpha),$$

Where $F_{1} \ldots F_{n}$ are the cumulative distribution functions of the individual probability distributions (in our case, the predictive distributions of each forecast model $i$ contributed to the hub), $w_i$ are a set of weights in $[0,1]$; and $\alpha$ are the quantile levels, such that following notation introduced in [@genestVincentizationRevisited1992],

$$F^{-1}(\alpha) = \mathrm{inf} \{t : F_i(t) \geq \alpha \}.$$

Different ensemble choices then mainly translate to the choice of weights $w_i$. An arithmetic mean ensemble uses weights at $w_i=1/n$, where all weights are equal and sum up to 1.

Alternatively, we can choose a set of weights to apply to forecasts before they are combined. Numerous options exist for choosing these weights with the aim to maximise predictive performance, including choosing weights to reflect each forecast’s past performance (thereby moving from an untrained to a trained ensemble). A straightforward choice is so-called inverse score weighting. In this case, the weights are calculated as

$$w_i = \frac{1}{S_i},$$ 

where $S_i$ reflects the forecasting skill calculated as the relative WIS of forecaster $i$, calculated over all available model data, and normalised so that weights sum to 1. This method of weighting was found in the US to outperform unweighted scores during some time periods [@taylorCombiningProbabilisticForecasts2021] but this was not confirmed in a similar study in Germany and Poland [@bracherPreregisteredShorttermForecasting2021].

When constructing ensembles from quantile means, a single outlier can have an oversized effect on the ensemble forecast. Previous research has found that a median ensemble, replacing the arithmetic mean of each quantile with a median of the same values, yields competitive performance while maintaining robustness to outlying forecasts [@rayComparingTrainedUntrained2022]. Building on this, we also created weighted median ensembles using the weights described above and a Harrel-Davis quantile estimator with a beta function to approximate the weighted percentiles [@harrellNewDistributionfreeQuantile1982]. We then compared the performance of unweighted and inverse relative WIS weighted mean and median ensembles, comparing the ratio of interval scores between each ensemble model relative to the baseline model.

# Results

```{r figure-scores-by-target, fig.cap=figure_scores_by_target_cap, fig.height=5, fig.width=5}
source(here("code", "summarise", "figure-scores-by-target.R"))
figure_scores_by_target
```

For 32 European countries, we collected, visualised, and made available online weekly COVID-19 forecasts and observed data [@europeancovid-19forecasthubEuropeanCovid19Forecast, @katharine_sherratt_2022_7356267]. Over the whole study period, we collected forecasts from `r sum_submissions$n_models` unique models. Modellers created forecasts choosing from a set of 32 possible locations, four time horizons, and two variables, and modellers variously joined and left the Hub over time. This meant the number of models contributing to the Hub varied over time and by forecasting target. Using all models and the ensemble, we created `r nrow(scores_model)` forecasting scores, where each score summarises a unique combination of forecasting model, variable, country, and week ahead horizon (Figure \@ref(fig:figure-scores-by-target)).

Of the total `r sum_submissions$n_models` models, we received the most forecasts for Germany, with 29 unique models submitting one-week case forecasts, while only 12 models ever submitted four-week case or death forecasts for Liechtenstein. Modelling teams also differed in how they expressed uncertainty. Only `r length(setdiff(submissions_point, submissions_quantile))` models provided point forecasts with no estimate of uncertainty around their predictions, while `r submissions_quantiles_full$models` models provided the full set of 23 probabilistic quantiles across the predictive distribution for each target. 

In this evaluation we included `r modellers$n_model` models in comparison to the ensemble forecast (Figure \@ref(fig:figure-scores-by-target)). We have included metadata provided by modellers in the supplement and online [@europeancovid-19forecasthubEuropeanCovid19Forecast, @katharine_sherratt_2022_7356267]. In this evaluation, at most `r modellers$targets$max_scores$n_models` models contributed forecasts for `r modellers$targets$max_scores$target_variable` in `r modellers$targets$max_scores$location_name` at the `r modellers$targets$max_scores$horizon` week horizon, with an accumulated `r modellers$targets$max_scores$n_scores` forecast scores for that single target over the study period. In contrast, `r modellers$targets$min_scores$target_variable` in `r modellers$targets$min_scores$location` at the `r modellers$targets$min_scores$horizon` week horizon saw the smallest number of forecasts, with only `r modellers$targets$min_scores$n_models` independent models contributing `r modellers$targets$min_scores$n_scores` forecast scores at any time over the `r hub$n_weeks` week period. Of the `r modellers$n_model` models included in this evaluation, `r modellers$n_model - modellers$n_model_wis` models provided less than the full set of 23 quantiles, and were excluded when creating the ensemble. No ensemble forecast was composed of less than `r modellers$targets$min_models` independent models.

```{r example-ensemble, fig.cap=figure_example_ensemble_cap, fig.height=3, fig.width=7}
source(here("code", "summarise", "figure-example-ensemble.R"))
figure_example_ensemble
```

We visually compared the absolute performance of forecasts in predicting numbers of incident cases and deaths. We observed that forecasts predicted well in times of stable epidemic behaviour, while struggling to accurately predict at longer horizons around inflection points, for example during rapid changes in population-level behaviour or surveillance. Forecast models varied widely in their ability to predict and account for the introduction of new variants, giving the ensemble forecast over these periods a high level of uncertainty. An example of weekly forecasts from the ensemble model is shown in Figure \@ref(fig:example-ensemble).

In relative terms, the ensemble of all models performed well compared to both its component models and the baseline. By relative WIS scaled against a baseline of 1 (where a score <1 indicates outperforming the baseline), the median score of forecasts from the Hub ensemble model  was `r hub_scores[["score_range"]][["50%"]]`, within an interquartile range of `r hub_scores[["score_range"]][["25%"]]` at 25% probability to `r hub_scores[["score_range"]][["75%"]]` at 75% probability. Meanwhile the median score of forecasts across all participating models (excluding the Hub ensemble) was `r modeller_scores[["score_range"]][["50%"]]` (IQR `r modeller_scores[["score_range"]][["25%"]]`-`r modeller_scores[["score_range"]][["75%"]]`). 

Across all horizons and locations, the ensemble performed better on scaled relative WIS than `r hub_scores$wis_vs_models$Cases$p_better`% of forecast scores when forecasting cases (with a total N=`r hub_scores$wis_vs_models$Cases$n` from `r hub_scores$wis_vs_models$Cases$models` unique models), and `r hub_scores$wis_vs_models$Deaths$p_better`% of scores for forecasts of incident deaths (N=`r hub_scores$wis_vs_models$Deaths$n` scores from `r hub_scores$wis_vs_models$Deaths$models` models). We also saw high performance from the ensemble when evaluating against all models including those who did not submit the full set of probabilistic quantile predictions (`r hub_scores$ae_vs_models$Cases$p_better`% for cases with N=`r hub_scores$ae_vs_models$Cases$n` scores from `r hub_scores$ae_vs_models$Cases$models` models, and `r hub_scores$ae_vs_models$Deaths$p_better`% for deaths, N=`r hub_scores$ae_vs_models$Deaths$n` scores from `r hub_scores$ae_vs_models$Deaths$models` models).

```{r performance-horizon, fig.cap=figure_horizon_cap, fig.width=5, fig.height=6}
source(here("code", "summarise", "figure-performance-horizon.R"))
figure_horizon
```

The performance of individual and ensemble forecasts varied by length of the forecast horizon (Figure \@ref(fig:performance-horizon)). At each horizon, the typical performance of the ensemble outperformed both the baseline model and the aggregated scores of all its component models, although we saw wide variation  between individual models in performance across horizons. Both individual models and the ensemble saw a trend of worsening performance at longer horizons when forecasting cases with the median scaled relative WIS of the ensemble across locations worsened from `r fig_horizon_ensemble[["Cases"]][["median_score"]][1]` for one-week ahead forecasts  to `r fig_horizon_ensemble[["Cases"]][["median_score"]][4]` when forecasting four weeks ahead. Performance for forecasts of deaths was more stable over one through four weeks, with median ensemble performance moving from `r fig_horizon_ensemble[["Deaths"]][["median_score"]][1]` to `r fig_horizon_ensemble[["Deaths"]][["median_score"]][4]` across the four week horizons.

We observed similar trends in performance across horizon when considering how well the ensemble was calibrated with respect to the observed data. At one week ahead the case ensemble was well calibrated (ca. 50% and 95% nominal coverage at the 50% and 95% levels respectively). This did not hold at longer forecast horizons as the case forecasts became increasingly over-confident. Meanwhile, the ensemble of death forecasts was well calibrated at the 95% level across all horizons, and the calibration of death forecasts at the 50% level improved with lengthening horizons compared to being underconfident at shorter horizons.

```{r performance-location, fig.cap = fig_loc_cap, fig.height=5, fig.width=7}
source(here("code", "summarise", "figure-performance-location.R"))
figure_loc
```

The ensemble also performed consistently well in comparison to individual models when forecasting across countries (Figure \@ref(fig:performance-location)). In total, across 32 countries forecasting for one through four weeks, when forecasting cases the ensemble outperformed 75% of component models in `r h1234_summary[["Cases"]][["beat_75"]]` countries, and outperformed all available models in `r h1234_summary[["Cases"]][["beat_100"]]` countries. When forecasting deaths, the ensemble outperformed 75% and 100% of models in `r h1234_summary[["Deaths"]][["beat_75"]]` and `r h1234_summary[["Deaths"]][["beat_100"]]` countries respectively. Considering only the the two-week horizon shown in Figure \@ref(fig:performance-location), the ensemble of case forecasts outperformed 75% models in `r fig_loc_summary[["Cases"]][["beat_75"]]` countries and all models in only `r fig_loc_summary[["Cases"]][["beat_100"]]` countries. At the two-week horizon for forecasts of deaths, the ensemble outperformed  75% and 100% of its component models in `r fig_loc_summary[["Deaths"]][["beat_75"]]` and `r fig_loc_summary[["Deaths"]][["beat_100"]]` countries respectively.

```{r ensembles}
source(here("code", "summarise", "table-1.R"))
ensemble_eval_table
```

We considered alternative methods for creating ensembles from the participating forecasts, using either a mean or median to combine either weighted or unweighted forecasts. We evaluated each alternative ensemble model against the baseline model, taking the mean score ratio across all targets (Table \@ref(tab:ensembles)). Across locations we observed that the median outperformed the mean across all one through four week horizons and both cases and death targets, for all but cases at the 1 week horizon. This held regardless of whether the component forecasts were weighted or unweighted by their individual past performance. Between methods of combination, weighting made little difference to the performance of the median ensemble, but appeared to improve performance of a mean ensemble in forecasting deaths.

# Discussion

<!-- paragraph: summary and main result -->

We collated `r round(hub$n_weeks / 4.5)` months of forecasts of COVID-19 cases and deaths across 32 countries in Europe, collecting from multiple independent teams and using a principled approach to standardising both forecast targets and the predictive distribution of forecasts. We combined these into an ensemble forecast and compared the relative performance of forecasts between models, finding that the ensemble forecasts outperformed most individual models across all countries and horizons over time.

<!-- paragraph: inability to predict changes in trend -->

Across all models we observed that forecasting changes in trend in real time was particularly challenging. Our study period included multiple fundamental changes in viral-, individual-, and population-level factors driving the transmission of COVID-19 across Europe. In early 2021, the introduction of vaccination started to change population-level associations between infections, cases, and deaths [@europeancentrefordiseasepreventionandcontrolInterimGuidanceBenefits2021], while the Delta variant emerged and became dominant  [@europeancentrefordiseasepreventionandcontrolThreatAssessmentBrief2021]. Similarly from late 2021 we saw the interaction of individually waning immunity during the emergence and global spread of the Omicron variant [@europeancentrefordiseasepreventionandcontrolAssessmentFurtherSpread2022]. Neither the extent nor timing of these factors were uniform across European countries covered by the Forecast Hub [@europeancentrefordiseasepreventionandcontrolOverviewImplementationCOVID192021]. This meant that the performance of any single forecasting model depended partly on the ability, speed, and precision with which it could adapt to new conditions for each forecast target.

<!-- paragraph: performance over future horizons -->

We observed a contrast between a more stable performance of forecasting deaths further into the future compared to forecasts of cases. Previous work has found rapidly declining performance for case forecasts with increasing horizon [@cramerEvaluationIndividualEnsemble2021; @castroTurningPointEnd2020], while death forecasts can perform well with up to six weeks lead time [@friedmanPredictivePerformanceInternational2021]. We can link this to the specific epidemic dynamics in this study. 

First, COVID-19 has a typical serial interval of less than a week [@aleneSerialIntervalIncubation2021]. This implies that case forecasts of more than two weeks only remain valid if rates of both transmission and detection remain stable over the entire forecast horizon. In contrast, we saw rapid changes in epidemic dynamics across many countries in Europe over our study period, impacting the longer term case forecasts. 

Second, we can interpret the higher reliability of death forecasts as due to the different lengths and distributions of time lags from infection to case and death reporting [@jinLagDailyReported2021]. For example, a spike in infections may be matched by a consistently sharp increase in case reporting, but a longer-tailed distribution of the subsequent increase in death reports. This creates a lower magnitude of fluctuation in the time-series of deaths compared to that of cases. Similarly, surveillance data for death reporting is substantially more consistent, with fewer errors and retrospective corrections, than case reporting [@catalaRobustEstimationDiagnostic2021]. 

Third, we also note that the performance of trend-based forecasts may have benefited from the slower changes to trends in incident deaths caused by gradually increasing vaccination rates. These features  allow forecasters to incorporate the effect of changes in transmission more easily when forecasting deaths, compared to cases.

<!-- paragraph: benefit of using ensemble models -->

We found the ensemble in this study continued to outperform both other models and the baseline at up to four weeks ahead. Our results support previous findings that ensemble forecasts are the best or nearly the best performing models with respect to absolute predictive performance and appropriate coverage of uncertainty [@funkShorttermForecastsInform2020; @cramerEvaluationIndividualEnsemble2021; @viboudRAPIDDEbolaForecasting2018]. While the ensemble was consistently high performing, it was not strictly dominant across all forecast targets, reflecting findings from previous comparable studies of COVID-19 forecasts [@bracherPreregisteredShorttermForecasting2021; @brooksComparingEnsembleApproaches2020]. Our finding suggests the usefulness of an ensemble as a robust summary when forecasting across many spatio-temporal targets, without replacing the importance of communicating the full range of model predictions.

<!-- paragraph: alternative ensemble methods -->

When exploring variations in ensemble methods, we found that the choice of median over means yielded the most consistent improvement in predictive performance, regardless of the method of weighting. Other work has supported the importance of the median in providing a stable forecast that better accounts for outlier forecasts than the mean [@brooksComparingEnsembleApproaches2020], although this finding may be dependent on the quality of the individual forecast submissions. In contrast, weighing models by past performance did not result in any consistent improvement in performance. This is in line with  existing mixed evidence for any optimal ensemble method for combining short term probabilistic infectious disease forecasts. Many methods of combination have performed competitively in analyses of forecasts for COVID-19 in the US, including the simple mean and weighted approaches outperforming unweighted or median methods [@taylorCombiningProbabilisticForecasts2021]. This contrasts with later analyses finding weighted methods to give similar performance to a median average [@rayEnsembleForecastsCoronavirus2020e; @brooksComparingEnsembleApproaches2020]. We can partly explain this inconsistency if performance of each method depends on the outcome being predicted (cases, deaths), its count (incident, cumulative) and absolute level, the changing disease dynamics, and the varying quality and quantity of forecasting teams over time.

<!-- paragraph: limitations -->

We note several limitations in our approach to assessing the relative performance of an ensemble among forecast models. While we have described differences in model scores, we have not used any formal statistical test for comparing forecast scores, such as the Diebold-Mariano test [@dieboldComparingPredictiveAccuracy1995], recognising that it is unclear how this is best achieved across many models. Our results are the outcome of evaluating forecasts against a specific performance metric and baseline, where multiple options for evaluation exist and the choice reflects the aim of the evaluation process. Further, our choice of baseline model affects the given performance scores in absolute terms, and more generally the choice of appropriate baseline for epidemic forecast models is not obvious when assessing infectious disease forecasts. The model used here is supported by previous work [@cramerEvaluationIndividualEnsemble2021], yet previous evaluation in a similar context has suggested that choice of baseline affects relative performance in general [@bracherNationalSubnationalShortterm2021], and future research should be done on the best choices of baseline models in the context of infectious disease epidemics.

Our assessment of forecast performance may further have been inaccurate due to limitations in the observed data against which we evaluated forecasts. We sourced data from a globally aggregated database to maintain compatibility across 32 countries [@dongInteractiveWebbasedDashboard2020]. However, this made it difficult to identify the origin of lags and inconsistencies between national data streams, and to what extent these could bias forecasts for different targets. In particular we saw some real time data revised retrospectively, introducing bias in either direction where the data used to create forecasts was not the same as that used to evaluate it. We attempted to mitigate this by using an automated process for determining data revisions, and excluding forecasts made at a time of missing, unreliable, or heavily revised data. We also recognise that evaluating forecasts against updated data is a valid alternative approach used elsewhere  [@cramerEvaluationIndividualEnsemble2021]. More generally it is unclear if the expectation of observation revisions should be a feature built into forecasts. Further research is needed to understand the perspective of end-users of forecasts in order to assess this.

The focus of this study was describing and summarising an ensemble of many models. We note that we have little insight into the individual methods and wide variety of assumptions that modellers used. While we asked modellers to provide a short description of their methods, we did not create a rigorous framework for this, and we did not document whether modellers changed the methods for a particular submitted model over time. Both the content of and variation in modelling methods and assumptions are likely to be critical to explaining performance, rather than describing or summarising it. Exploring modellers' methods and relating this to forecast performance will be an important area of future work.

<!-- paragraph: existing work + benefits in Europe -->

In an emergency setting, access to visualised forecasts and underlying data is useful for researchers, policymakers, and the public [@cdcCoronavirusDisease20192020]. Previous European multi-country efforts to forecast COVID-19 have included only single models adapted to country-specific parameters [@aguasModellingCOVID19Pandemic2020;@adibParticipatoryModellingApproach2021;@agostoMonitoringCOVID19Contagion2021]. 

The European Forecasting Hub acted as a unique tool for creating an open-access, cross-country modelling network, and connecting this to public health policy across Europe. By opening participation to many modelling teams and with international high participation, we were able to create robust ensemble forecasts across Europe. This also allows comparison across forecasts built with different interpretations of current data, on a like for like scale in real time. The European Hub has supported policy outputs at an international, regional, and national level, including Hub forecasts cited weekly in ECDC Communicable Disease Threats Reports (@europeancentrefordiseasepreventionandcontrolWeeklyThreatsReports). 

For forecast producers, an easily accessible comparison between results from different methods can highlight individual strengths and weaknesses and help prioritise new areas of work. Collating time-stamped predictions ensures that we can test true out-of-sample performance of models and avoid retrospective claims of performance. Testing the limits of forecasting ability with these comparisons forms an important part of communicating any model-based prediction to decision makers. For example, the weekly ECDC Communicable Disease Threats reports include the specific results of this work by qualitatively highlighting the greater uncertainty around case forecasts compared to death forecasts.

<!-- paragraph: future work -->

This study raises many further questions which could inform epidemic forecast modellers and users. The dataset created by the European Forecast Hub is an openly accessible, standardised, and extensively documented catalogue of real time forecasting work from a range of teams and models across Europe [@europeancovid-19forecasthubEuropeanCovid19Forecast], and we recommend its use for further research on forecast performance. In the code developed for this study we provide a worked example of downloading and using both the forecasts and their evaluation scores [@PredictivePerformanceMultimodel2022].

Future work could explore the impact on forecast models of changing epidemiology at a broad spatial scale by combining analyses of trends and turning points in cases and deaths with forecast performance, or extending to include data on vaccination, variant, or policy changes over time. There is also much scope for future research into methods for combining forecasts to improve performance of an ensemble. This includes altering the inclusion criteria of forecast models based on different thresholds of past performance, excluding or including only forecasts that predict the lowest and highest values (trimming) [@taylorCombiningProbabilisticForecasts2021], or using alternative weighting methods such as quantile regression averaging [@funkShorttermForecastsInform2020]. Exploring these questions would add to our understanding of real time performance, supporting and improving future forecasting efforts.

We see additional scope to adapt the Hub format to the changing COVID-19 situation across Europe. We have extended the Forecast Hub infrastructure to include short term forecasts for hospitalisations with COVID-19, which is a challenging task due to limited data across the locations covered by the hub. As the policy focus shifts from immediate response to anticipating changes brought by vaccinations or the geographic spread of new variants [@europeancentrefordiseasepreventionandcontrolOverviewImplementationCOVID192021], we are also separately investigating models for longer term scenarios in addition to the short term forecasts in a similar framework to existing scenario modelling work in the US [@borcheringModelingFutureCOVID192021].

<!-- paragraph: conclusions -->

In conclusion, we have shown that during a rapidly evolving epidemic spreading through multiple populations, an ensemble forecast performed highly consistently across a large matrix of forecast targets, typically outperforming the majority of its separate component models and a naive baseline model. In addition, we have linked issues with the predictability of short-term case forecasts to underlying COVID-19 epidemiology, and shown that ensemble methods based on past model performance were unable to reliably improve forecast performance. Our work constitutes a step towards both unifying COVID-19 forecasts and improving our understanding of them.

# Supplementary information

**Supplementary file 1. EPIFORGE reporting guidelines** 
Completed checklist following reporting guidelines on epidemic forecasting research, after: Pollet et. al., 2021. Recommended reporting items for epidemic forecasting and prediction research: The EPIFORGE 2020 guidelines. PLoS Med. 19;18(10):e1003793. doi: 10.1371/journal.pmed.1003793

**Supplementary file 2. Participating team metadata** 
Team metadata for teams participating in the European Forecast Hub and evaluated in this study.

# References

<div id="refs"></div>

 \newpage