-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathscoringutils.Rmd
415 lines (308 loc) · 18.2 KB
/
scoringutils.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
---
title: "Getting started"
author: "Nikos Bosse"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Getting started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.width = 7,
collapse = TRUE,
comment = "#>")
library(scoringutils)
library(magrittr)
library(data.table)
library(ggplot2)
library(knitr)
```
The `scoringutils` package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against observed values. You can find more information in the paper [Evaluating Forecasts with scoringutils in R](https://arxiv.org/abs/2205.07090) as well as the [Metrics-Vignette](https://epiforecasts.io/scoringutils/articles/metric-details.html) and the [Scoring forecasts directly Vignette](https://epiforecasts.io/scoringutils/articles/scoring-forecasts-directly.html).
The `scoringutils` package offers convenient automated forecast evaluation in a `data.table` format (using the function `score()`), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases.
The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the `scoringRules` package, which provides a comprehensive collection of proper scoring rules (also used in `scoringutils`). In contrast to other packages, `scoringutils` offers functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models.
Predictions can be handled in various formats: `scoringutils` can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.
## Input formats
Most of the time, the `score()` function will be able to do the entire evaluation for you. All you need to do is to pass in a `data.frame` with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see `example_quantile` for the expected format), based on predictive samples (see `example_continuous` and `example_integer` for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below):
```{r, echo=FALSE}
requirements <- data.table(
"Format" = c(
"quantile-based", "sample-based", "binary", "pairwise-comparisons"
),
`Required columns` = c(
"'observed', 'predicted', 'quantile'",
"'observed', 'predicted', 'sample'", "'observed', 'predicted'",
"additionally a column 'model'"
)
)
kable(requirements)
```
Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future.
`scoringutils` automatically tries to determine the *unit of a single forecast*, i.e. the combination of existing columns that are able to uniquely identify a single forecast. It uses all existing columns for this, which can sometimes lead to issues. We therefore recommend using the function `set_forecast_unit()` to determine the forecast unit manually. The function simply drops unneeded columns, while making sure that some necessary, 'protected columns' like "predicted" or "observed" are retained.
```{r}
colnames(example_quantile)
set_forecast_unit(
example_quantile,
c("location", "target_end_date", "target_type", "horizon", "model")
) %>%
colnames()
```
## Checking the input data
The function `validate()` can be used to check the input data. It gives a summary of what `scoringutils` thinks you are trying to achieve. It infers the type of the prediction target, the prediction format, and the unit of a single forecasts, gives an overview of the number of unique values per column (helpful for spotting missing data) and returns warnings or errors.
```{r}
head(example_quantile)
```
```{r}
validate(example_quantile)
```
If you are unsure what your input data should look like, have a look at the `example_quantile`, `example_integer`, `example_continuous` and `example_binary` data sets provided in the package.
The output of `validate()` can later be directly used as input to `score()` (otherwise, `score()` will just run `validate()` anyway internally).
## Showing available forecasts
The function `available_forecasts()` may also be helpful to determine where forecasts are available. Using the `by` argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run
```{r}
available_forecasts(example_quantile, by = c("model", "target_type"))
```
We see that 'epiforecasts-EpiNow2' has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts.
This information can also be visualised using `plot()`:
```{r, fig.width=11, fig.height=6}
example_quantile %>%
available_forecasts(by = c("model", "forecast_date", "target_type")) %>%
plot() +
facet_wrap(~ target_type)
```
You can also visualise forecasts directly using the `plot_predictions()` function:
```{r, fig.width = 9, fig.height = 6}
example_quantile %>%
make_NA(
what = "truth",
target_end_date >= "2021-07-15",
target_end_date < "2021-05-22"
) %>%
make_NA(
what = "forecast",
model != "EuroCOVIDhub-ensemble",
forecast_date != "2021-06-28"
) %>%
plot_predictions(
x = "target_end_date",
by = c("target_type", "location")
) +
facet_wrap(target_type ~ location, ncol = 4, scales = "free")
```
## Scoring and summarising forecasts
Forecasts can easily be scored using the `score()` function.
For clarity, we suggest setting the forecast unit explicitly and you may also want to call `validate()` explicitly.
```{r}
scores <- example_quantile %>%
set_forecast_unit(
c("location", "target_end_date", "target_type", "location_name",
"forecast_date", "model", "horizon")
) %>%
validate() %>%
score()
head(scores)
```
Note that in the above example some columns contain duplicated information with regards to the forecast unit, e.g. "location" and "location_name", and could be dropped.
```{r}
example_quantile %>%
set_forecast_unit(
c("location", "target_end_date", "target_type",
"forecast_date", "model", "horizon")
) %>%
validate()
```
If we drop essential information, for example, the "target_type" column, we'll get an error informing us that the forecasts aren't uniquely identified any more.
```{r, error=TRUE}
example_quantile %>%
set_forecast_unit(
c("location", "target_end_date",
"forecast_date", "model", "horizon")
) %>%
validate()
```
The function `get_duplicate_forecasts()` may help to investigate the issue. When filtering for only median forecasts of the EuroCOVIDhub-ensemble, we can see that there are indeed two forecasts for every date, location and horizon.
```{r}
duplicates <- example_quantile %>%
set_forecast_unit(
c("location", "target_end_date",
"forecast_date", "model", "horizon")
) %>%
get_duplicate_forecasts()
duplicates[quantile == 0.5 & model == "EuroCOVIDhub-ensemble", ] %>%
head()
```
The function `score()` returns unsumarised scores, which in most cases is not what the user wants. It returns a single score per forecast (as determined by the forecast unit). For forecasts in a quantile format, it returns one score per quantile.
A second function, `summarise_scores()` takes care of summarising these scores to the level specified by the user. The `by` argument can be used to define the level of summary. By default, `by = NULL` and the summary unit is assumed to be the same as the unit of a single forecast. For continuous forecasts, this means that nothing happens if `by` isn't specified.
```{r}
scores <- score(example_continuous)
all(scores == summarise_scores(scores), na.rm = TRUE)
```
For quantile based forecasts, if `by = NULL`, then scores are summarised across quantiles and instead of one score per forecast_unit and quantile we only get one score per forecast unit.
```{r}
scores <- example_quantile %>%
set_forecast_unit(
c("location", "target_end_date", "target_type",
"forecast_date", "model", "horizon")
) %>%
validate() %>%
score()
head(scores)
scores %>%
summarise_scores() %>%
head()
```
Through the `by` argument we can specify what unit of summary we want. We can also call `sumarise_scores()` multiple tines, e.g to round your outputs by specifying e.g. `signif()` as a summary function.
```{r}
scores %>%
summarise_scores(by = c("model", "target_type")) %>%
summarise_scores(fun = signif, digits = 2) %>%
kable()
```
### Scoring point forecasts
Point forecasts can be scored by making use of the quantile-based format, but with a value of `NA` or `"point"` in the quantile column. Point forecasts can be scored alongside other quantile-based forecasts. As point forecasts will have values of `NA` for metrics designed for probabilistic forecasts, it is important to use `na.rm = TRUE` when summarising.
```{r}
suppressMessages(score(example_point)) %>%
summarise_scores(by = "model", na.rm = TRUE)
```
### Adding empirical coverage
For quantile-based forecasts we are often interested in specific coverage-levels, for example, what percentage of true values fell between all 50% or the 90% prediction intervals. We can add this information using the function `add_coverage()`. This function also requires a `by` argument which defines the level of grouping for which the percentage of true values covered by certain prediction intervals is computed.
```{r}
score(example_quantile) %>%
summarise_scores(by = c("model", "target_type")) %>%
summarise_scores(fun = signif, digits = 2)
```
### Adding relative scores
In order to better compare models against each other we can use relative scores which are computed based on pairwise comparisons (see details below). Relative scores can be added to the evaluation using the function `summarise_scores()`. This requires a column called 'model' to be present. Pairwise comparisons are computed according to the grouping specified in `by`: essentially, the data.frame with all scores gets split into different data.frames according to the values specified in `by` and relative scores are computed for every individual group separately. The `baseline` argument allows us to specify a baseline that can be used to scale relative scores (all scores are divided by the baseline relative score). For example, to obtain relative scores separately for different forecast targets, we can run
```{r}
score(example_quantile) %>%
summarise_scores(by = c("model", "target_type")) %>%
add_pairwise_comparison(baseline = "EuroCOVIDhub-ensemble")
```
## Visualising scores
### Coloured table
A simple coloured table can be produced based on the scores:
```{r}
score(example_integer) %>%
summarise_scores(by = c("model", "target_type"), na.rm = TRUE) %>%
summarise_scores(fun = signif, digits = 2) %>%
plot_score_table(by = "target_type") +
facet_wrap(~ target_type, nrow = 1)
```
### Score heatmap
We can also summarise one particular metric across different categories using a simple heatmap:
```{r, fig.width=11, fig.height=6}
score(example_continuous) %>%
summarise_scores(by = c("model", "location", "target_type")) %>%
plot_heatmap(x = "location", metric = "bias") +
facet_wrap(~ target_type)
```
### Weighted interval score components
The weighted interval score can be split up into three components: Over-prediction, under-prediction and dispersion. These can be visualised separately in the following way:
```{r}
score(example_quantile) %>%
summarise_scores(by = c("target_type", "model")) %>%
plot_wis() +
facet_wrap(~ target_type, scales = "free")
```
## Calibration
Calibration is a measure statistical consistency between the forecasts and observed values. The most common way of assessing calibration (more precisely: probabilistic calibration) are PIT histograms. The probability integral transform (PIT) is equal to the cumulative distribution function of a forecast evaluated at the observed value. Ideally, pit values should be uniformly distributed after the transformation.
We can compute pit values as such:
```{r}
example_continuous %>%
pit(by = "model")
```
And visualise the results as such:
```{r}
example_continuous %>%
pit(by = c("model", "target_type")) %>%
plot_pit() +
facet_grid(model ~ target_type)
```
Similarly for quantile-based forecasts:
```{r}
example_quantile[quantile %in% seq(0.1, 0.9, 0.1), ] %>%
pit(by = c("model", "target_type")) %>%
plot_pit() +
facet_grid(model ~ target_type)
```
<!-- Another way to look at calibration are interval coverage and quantile coverage. Interval coverage is the percentage of true values that fall inside a given central prediction interval. Quantile coverage is the percentage of observed values that fall below a given quantile level. -->
<!-- In order to plot interval coverage, you need to include "range" in the `by` argument to `summarise_scores()`. The green area on the plot marks conservative behaviour, i.e. your empirical coverage is greater than it nominally need be (e.g. 55% of true values covered by all 50% central prediction intervals.) -->
```{r, eval=FALSE, include=FALSE}
example_quantile %>%
score() %>%
summarise_scores(by = c("model", "range")) %>%
plot_interval_coverage()
```
<!-- To visualise quantile coverage, you need to include "quantile" in `by`. Again, the green area corresponds to conservative forecasts, where central prediction intervals would cover more than needed. -->
```{r, eval=FALSE, include=FALSE}
example_quantile %>%
score() %>%
summarise_scores(by = c("model", "quantile")) %>%
plot_quantile_coverage()
```
## Pairwise comparisons
Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all combinations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model.
In `scoringutils`, pairwise comparisons can be made in two ways: Through the standalone function `pairwise_comparison()` or from within `summarise_scores()` which simply adds relative scores to an existing set of scores.
```{r}
example_quantile %>%
score() %>%
pairwise_comparison(by = "model", baseline = "EuroCOVIDhub-baseline")
```
```{r}
example_quantile %>%
score() %>%
summarise_scores(by = "model") %>%
add_pairwise_comparison(baseline = "EuroCOVIDhub-baseline")
```
If using the `pairwise_comparison()` function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty.
```{r, fig.width=9, fig.height=7}
example_quantile %>%
score() %>%
pairwise_comparison(by = c("model", "target_type")) %>%
plot_pairwise_comparison() +
facet_wrap(~ target_type)
```
## Additional analyses and visualisations
### Correlation between scores
It may sometimes be interesting to see how different scores correlate with each other. We can examine this using the function `correlation()`. When dealing with quantile-based forecasts, it is important to call `summarise_scorees()` before `correlation()` to summarise over quantiles before computing correlations.
```{r}
example_quantile %>%
score() %>%
summarise_scores() %>%
correlation()
```
Visualising correlations:
```{r}
example_quantile %>%
score() %>%
summarise_scores() %>%
correlation() %>%
plot_correlation()
```
<!-- ### Scores by interval ranges -->
<!-- If you would like to see how different forecast interval ranges contribute to average scores, you can visualise scores by interval range: -->
```{r, eval = FALSE, include = FALSE}
example_quantile %>%
score() %>%
summarise_scores(by = c("model", "range", "target_type")) %>%
plot_ranges() +
facet_wrap(~ target_type, scales = "free")
```
## Converting to quantile-based forecasts
Different metrics are available for different forecasting formats. In some cases, you may for example have forecasts in a sample-based format, but wish to make use of some of the functionality only available to quantile-based forecasts. For example, you may want to use the decomposition of the weighted interval score, or may like to compute interval coverage values.
You can convert your sample-based forecasts into a quantile-based format using the function `sample_to_quantile()`. There is, however, one caveat: Quantiles will be calculated based on the predictive samples, which may introduce a bias if the number of available samples is small.
```{r}
example_integer %>%
sample_to_quantile(
quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)
) %>%
score()
```
## Available metrics
An overview of available metrics can be found in the `metrics` data set that is included in the package.
```{r}
metrics
```
## Lower level functions
Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better.