Score model out #46

elray1 · 2024-08-20T14:01:39Z

This would resolve #13

Current status as of Aug 22:

have mostly implemented the stuff in the issue statement for mean, median, pmf and quantile output types
unit tests in place

github-actions · 2024-08-20T14:03:33Z

🚀 Deployed on https://66e20506fef47d85e4ff8da8--hubevals-pr-previews.netlify.app

zkamvar

MY GLOB this is a herculean effort! I appreciate you spelling out what the scoringutils evaluations should be doing. I think your approach for doing the table joins and comparing the resulting columns is a good idea.

I (as always) have some suggestions that would reduce duplication, especially around error messages.

put the abort statements into separate functions. They don't have to be validate_x. They could be check_x or error_if_not_x. This will make testing easier and reduce code complexity.
for one-line if else blocks, change them to switch() statements to reduce complexity (and length).

Additionally, given that we are using examples from hubExamples, I think it might be prudent to add it as a suggested package and create examples using @examplesIf requireNamespace("hubExamples", quiet = TRUE) and replace the load() calls with the hubExamples data. This will make these test examples a bit more transparent.

tests/testthat/test-score_model_out.R

R/score_model_out.R

Co-authored-by: Zhian N. Kamvar <[email protected]>

nikosbosse

Had a first look at everything. As far as I can tell this should work - I'll check this out locally later and play around with the functions a bit more.

My main comment is related to the design of the functions. Maybe there is a way to simplify this a bit?

Some thoughts:

maybe we can offload more of the function documentation to scoringutils/reuse parts that are already documented there. Happy to work on this.
The metrics selection process seems very complicated, as users can provide either NULL, a character vector, or a list of functions. Not allowing character vectors would simplify things substantially

I'm lacking a bit of the overall context, i.e. I don't really know where and how the functions will be used which makes it a bit hard for me to provide helpful feedback. Overall, I think if I were to design the scoring functionality, I would probably do the following:

provide wrapper functions that transform from the hubverse format to the scoringutils format, leaving them with a validated forecast object that they can just pipe into score
let users score their forecasts in scoringutils land completely independently
provide functions to transform back to hubverse formats
within hubverse functions, do the same thing.

This would simplify the existing code a lot. I can see the appeal of having a single score_model_out function though.
In that case I would probably try to merge get_metrics() and get_metrics_default() (and get rid of get_metrics_character()) or even just have an if clause in score_model_out() to get rid of the se/ae depending on whether you're scoring the median or mean.

Happy to jump on a call and discuss this more if you think it would be helpful.

R/score_model_out.R

Co-authored-by: Nikos Bosse <[email protected]>

elray1 · 2024-08-28T16:52:58Z

Thanks for the review, @nikosbosse! It's helpful to hear your ideas and comments.

Quick response to two easy points:

"maybe we can offload more of the function documentation to scoringutils/reuse parts that are already documented there". This sounds great.
"provide wrapper functions that transform from the hubverse format to the scoringutils format, leaving them with a validated forecast object that they can just pipe into score". This makes sense, and I actually see was written down in issue write function that translates from hubverse format to scoringutils-ready format #11 which I forgot about, with the idea that score_model_out would call that function.
- I started on refactoring this with an idea that this could be included in this PR -- but then I decided that doing so would make review for this PR more burdensome, especially since both you and Zhian have already looked at the PR in its current state. So I propose that we get this PR done with an idea in mind that immediately after that I will go ahead with that refactor.

I think most of your remaining comments come down to two points (let me know if there are other ideas I'm missing):

do we really want to support characters for the metrics?; and
standardize a workflow of (a) transform to scoringutils formats; (b) do all the scoring and summarizing; (c) transform back to hubverse format

For point 1, my main reason for supporting character vectors is that I'd like to make it so users don't have to deal with assembling lists of functions themselves. This is particularly awkward for the interval coverage piece. Compare the following code to compute WIS, AE, and 80% and 95% interval coverage levels using a list of functions vs character strings. I would personally always choose to use the interface based on character strings here.

# metrics specified with a list of functions
scores_v1 <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c(
    scoringutils::metrics_quantile(select = c("wis", "ae_median", "interval_coverage_90")),
    list(
      interval_coverage_80 = purrr::partial(scoringutils::interval_coverage, interval_range = 80)
    )
  ),
  by = "model_id"
)

# metrics specified with character strings
scores_v2 <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c("wis", "ae_median", "interval_coverage_80", "interval_coverage_90"),
  by = "model_id"
)

For point 2, your comment makes sense. There are two things about this that are making me hesitate:

the "transform back to hubverse format" really just involves changing the column name from "model" to "model_id". There is also a subsetting of columns in there, though I'm not sure that falls under a heading of "transform to hubverse format" so much as "ensure we return a standardized set of output columns". Main point here being, I'm not sure it's worth pulling those items out into a separate function. The stuff they do is pretty minimal, and I think users who are manually working with scoringutils functions can handle the column renaming themselves if they want.
another related thought is that I'd prefer users of the package to be able to stick fully with the hubverse naming convention of "model_id" instead of "model" when using hubEvals::score_model_out. This is mainly relevant to the by specification. To support the use of by = "model_id" in the above examples, I've done the renaming of "model" --> "model_id" before the call to scoringutils::summarize. This means that the workflow inside of score_model_out is like:

convert to scoringutils format
compute "raw scores"
convert to a standardized hubverse format: mainly, change column name back to "model_id" (and also drop predicted/observed columns if present)
compute score summaries using by

If we switch the order of steps 3 and 4, I guess we could change any entries of by that are equal to "model_id" to "model" before calling summarize, but that feels like an awkward step just to support moving that scoringutils::summarize_scores call up.

(Like you, I'd be happy to get on a call to discuss further! Thought getting these first thoughts down in a comment on the PR would be good form though.)

…se-Modeling-Hubs/hubEvals into score_model_out

elray1 · 2024-09-05T18:48:39Z

Hi @nikosbosse and @zkamvar -- checking in; do you all have any more comments or suggestions on this PR? Happy to set up a time to chat on a call if that would be helpful. Thanks!

zkamvar

I realize that I started a review and then never finished it -_-

I was thinking about the character input controversy and I noticed that you are already doing this with the intervals: create a function factory that returns a list of interval coverage partials.

interval_coverage_of <- function(range) {
  res <- lapply(range, function(r) {
    purrr::partial(scoringutils::interval_coverage, interval_range = range)
  })
  names(res) <- sprintf("interval_coverage_%", range)
  return(res)
}

With this, then the user's call becomes simplified and we don't have to jump through hoops to validate the character inputs.

# metrics specified with a list of functions
scores_v1 <- hubExamples::forecast_outputs |>
  dplyr::filter(.data[["output_type"]] == "quantile") |>
  score_model_out(
    target_observations = hubExamples::forecast_target_observations,
    metrics = c(
      scoringutils::metrics_quantile(select = c("wis", "ae_median")),
      interval_coverage_of(range = c(80, 90))
    ),
    by = "model_id" 
  )

R/score_model_out.R

nikosbosse · 2024-09-09T09:09:08Z

R/score_model_out.R

+  metrics <- get_metrics(metrics, output_type, output_type_id_order)
+
+  # assemble data for scoringutils
+  su_data <- switch(output_type,


small note - long-term it might be a bit more elegant to implement transform_pmf_model() as an S3 method. Then we could omit the switch and just call transform_model_out(model_out_tbl, target_observations)
(don't think it makes much sense to do that now though)

At minimum, this switch will be refactored into a transform_model_out function per issue #11.

Currently, the hubverse tooling does not have separate S3 classes per output type; mostly, our tools accept data frames containing a mix of output types. But it could be worth thinking about whether there are other places (e.g. plotting?) where functionality is specific to the output type and having this kind of class structure would be helpful.

nikosbosse · 2024-09-09T09:20:02Z

R/score_model_out.R

+#'
+#' @noRd
+get_metrics_default <- function(output_type, output_type_id_order) {
+  metrics <- switch(output_type,


we'll be changing this in scoringutils by replacing functions like metrics_quantile(), metrics_point() etc. with a single get_metrics() function that is an S3 generic. So if you called get_metrics(forecast_quantile_object) you'd get the quantile metrics. This would simplify this code because one could transform the forecasts into a forecast object first and then rely on S3 to get the correct metrics.

nikosbosse · 2024-09-09T09:44:27Z

@elray1 apologies for the delayed reply. Overall I think the PR is fine, albeit a bit complicated. I imagine that this planned change in scoringutils epiforecasts/scoringutils#832 might bring some simplifications.

Depending on your desired timeline, I could either try and implement the mentioned change in scoringutils as quickly as possible and then make a second PR targeting this one that aims to simplify it and incorporates the update. Alternatively, we could merge this first and I do my PR afterwards.

I don't think we are in much disagreement regarding the overall workflow. (I would personally ask users to use scoringutils directly and avoid the wrappers, but I definitely see the rationale for that choice and think it's valid).
(One other potential idea: splitting out the "get your metrics" functionality out of score_model_out(), i.e. have a separate function that users can use if they want to e.g. specify metrics using a string and that function would output a list of functions ready for scoringutils. It would feel cleaner code-wise, but not sure 🤷)

Assuming we're not planning any changes for the overall workflow and how users interact with the functions we probably don't need a call. Most issues are mostly implementation details of how to simplify the code and maybe the docs.

elray1 · 2024-09-09T13:25:56Z

Thanks, Nikos. My general preference would be to merge this in sooner than later to keep PRs lightweight and reviews easier. That said, I plan to bring the question about what interface(s) we want to provide to the next devteam call to get some broader input. That means a merge of this PR won't happen until ~Wednesday or Thursday this week, so if writing the S3-based get_metrics function in scoringutils could be done on that timeline, I'd be happy to incorporate it here. Otherwise, I think it's perfectly fine to make that update in a separate PR.

I think we're up to 5 (not necessarily mutually exclusive) ideas for what the interface might look like now:

score_model_out(..., metrics = <NULL>, ...), we provide recommended defaults depending on output type
score_model_out(..., metrics = <character vector>, ...), we parse interval coverage rates to assemble list of functions
score_model_out(..., metrics = <list of functions>, ...)
score_model_out(..., metrics = <list of functions>, ...), we provide an interval_coverage_of function factory to help with interval functions
score_model_out(..., metrics = <list of functions>, ...), we provide a separate get_metrics function that accepts a character vector and returns a list of functions, essentially making part of what's done behind the scenes in item 2 a public method. (Though I guess we should think of a name other than get_metrics to avoid namespace collision with scoringutils::get_metrics... 🤔)

As I said above, I'm planning to bring the question of which of these we should support to the hubverse devteam call to get broader input before merging this.

nikosbosse · 2024-09-09T13:38:14Z

@elray1 sounds good! Let me know in case you'd like me to join the devteam call.

elray1 · 2024-09-09T15:58:46Z

@nikosbosse you're always welcome to join the hubverse devteam call, but I don't think there's any specific need in this instance. Do you have the call info, or would you like me to get you added to the calendar invite?

nikosbosse · 2024-09-09T16:17:11Z

sounds good. I don't have the details. Feel free to add me whenever you think it's useful for me to attend but I think it only makes sense for me to join in instances where you feel it's valuable.

elray1 · 2024-09-11T13:56:14Z

The decision on the hubverse devteam call was to support options 1 and 4 on the list above, namely:

score_model_out(..., metrics = <NULL>, ...), we provide recommended defaults depending on output type
score_model_out(..., metrics = <list of functions>, ...), we provide an interval_coverage_of function factory to help with interval functions

I will update this PR to reflect that decision.

elray1 · 2024-09-11T15:55:27Z

I just realized that this plan is slightly awkward for users in light of the changes that @nikosbosse is implementing over in epiforecasts/scoringutils#903, which will create S3 methods for each forecast class. For users of hubEvals::score_model_out, an object of class forecast_* won't be available yet at the time they call hubEvals::score_model_out, so they won't be able to just call scoringutils::get_metrics(). Instead, they would need to call the S3 method for the forecast class that will eventually be created, e.g.:

quantile_scores <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c(
    scoringutils::get_metrics.forecast_quantile(select = c("wis", "ae_median")),
    interval_coverage_of(range = c(80, 90))
  ),
  by = c("model_id")
)

Making this even worse, the names of these things don't always exactly align with the names they have in the hubverse. Two examples:

In the hubverse we have a single pmf output type that may be used to represent either nominal or ordinal forecasts, while scoringutils is planned to have separate forecast_ordinal or forecast_nominal types.
The hubverse distinguishes between mean and median forecasts, while scoringutils just has forecast_point.

So the right scoringutils::get_metrics.forecast_* would require more detailed knowledge of how hubverse and scoringutils concepts align than I'd ideally like to ask of our users.

nikosbosse · 2024-09-11T16:06:42Z

We also made a change that the example data in scoringutils is now pre-validated. So you can call get_metrics(example_quantile) instead.

We could also just have a function with the default metrics here that simply wraps get_metrics()

elray1 · 2024-09-11T16:15:30Z

OK, if I'm understanding the second part of the suggestion correctly, this sounds more like item 5 on the list above. So, we'd provide something like:

#' Get metrics functions to use with `scoringutils::score`
#'
#' @param metrics Character vector: names of metrics. See documentation for scoringutils::get_metrics for options.
#' @param interval_coverage Numeric vector: interval coverage levels. Each entry must be between 0 and 100
#' @param output_type Character string: the `output_type` of the model outputs that will be scored
#' @param is_ordinal Boolean: indicator of whether the target is nominal (`is_ordinal = FALSE`) or ordinal (`is_ordinal = TRUE`).  Relevant only if the `output_type` is `"pmf"`.
#'
#' @return named list of metric functions
get_metric_functions(metrics = NULL, interval_coverage = NULL, output_type, is_ordinal) {
  # validate that requested metric names and interval_coverage are consistent with output_type and is_ordinal

  # get metric functions based on metrics, using the appropriate forecast class

  # if applicable, add on interval coverage metrics based on coverage_levels, no string parsing required
}

Then, a hubEvals user might write:

quantile_scores <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = get_metric_functions(
    metrics = c("wis", "ae_median"),
    interval_coverage = c(80, 90),
    output_type = "quantile"
  ),
  by = "model_id"
)

I think I'm on board with that?
Is that what you were proposing?

nikosbosse · 2024-09-11T19:53:51Z

@elray1 I was thinking of something like this:

metrics_median <- function(select = NULL, exclude = NULL) {
  exclude <- unique(c("se_point", exclude))
  metrics <- scoringutils::get_metrics(example_point, select = select, exclude = exclude)
  return(metrics)
}

Basically you would have onemetrics_<type> function for every type that you have, returning a list of functions. And this function could just wrap the corresponding scoringutils function or alternatively alter the outputs a bit.

(the function arguments of course don't have to be called select and exclude - that's just what we use in scoringutils. You could just have a single metrics arg. Or an additional interval_coverage arg for metrics_quantile() etc.

You could then also wrap all these metrics_<type> in a single big get_metrics_function() that is able to handle all types at once, but not sure that's better.

elray1 · 2024-09-11T20:34:33Z

Nick and I just had a discussion about this and made the following decision: For score_model_out, we will support two kinds of arguments:

NULL: we provide our default recommended metrics, specific to the output_type.
a character vector, including metrics that are supported by default by scoringutils as well as "interval_coverage_XY", which will handle interval coverage rates for the most common cases like "interval_coverage_95".

This addresses the concern about the complexity of this function supporting 3 kinds of arguments, as lists of functions would no longer be supported. It also maintains support for the vast majority of our users to use this function in a natural way. The downside is that it means that means that score_model_out will not support users who want to use metrics that are not built into scoringutils or interval coverage at levels like 99.9. For those users, there are two paths to allow them to do their evaluations:

Short term, they can do the analysis more manually, using the transform_model_out function to get data in the format needed by scoringutils and then calling scoringutils functionality directly from there on.
Longer term, we can work to get additional metrics added to scoringutils so that they can be specified with strings.

elray1 · 2024-09-11T21:22:38Z

I have removed the support for providing a list of functions to metrics in 587c922, and am ready for a re-review of that. Thanks!

zkamvar

LGTM. It's a shame we couldn't implement the function methods, but I think this is still a good compromise.

initial draft of score_model_out

1f51495

elray1 added 6 commits August 22, 2024 10:28

Merge branch 'transform_pmf' into score_model_out

08d0edd

some partial progress on score_model_out

e101808

Merge branch 'transform_pmf' into score_model_out

d438a7a

updates to score_model_out

448e830

appease the linter

b92891a

update package imports

d6a6b36

elray1 marked this pull request as ready for review August 22, 2024 20:57

elray1 changed the base branch from main to transform_pmf August 22, 2024 20:57

Merge branch 'transform_pmf' into score_model_out

c38d738

zkamvar self-requested a review August 22, 2024 22:14

Base automatically changed from transform_pmf to main August 23, 2024 13:32

zkamvar requested changes Aug 23, 2024

View reviewed changes

elray1 and others added 16 commits August 26, 2024 09:41

Update tests/testthat/test-score_model_out.R

28724e4

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

41c8c50

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

fd06ad7

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

fa64a8e

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

7e3ce73

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update R/score_model_out.R

5b8eb40

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update R/score_model_out.R

a68fabb

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update R/score_model_out.R

a6a9c3b

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update R/score_model_out.R

7c8e86d

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

47f29a2

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

fed85ac

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

4c6eabc

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

6c443e6

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

778be41

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

f27b164

Co-authored-by: Zhian N. Kamvar <[email protected]>

Update tests/testthat/test-score_model_out.R

2367518

Co-authored-by: Zhian N. Kamvar <[email protected]>

nikosbosse reviewed Aug 28, 2024

View reviewed changes

R/score_model_out.R Outdated Show resolved Hide resolved

R/score_model_out.R Outdated Show resolved Hide resolved

R/score_model_out.R Outdated Show resolved Hide resolved

R/score_model_out.R Show resolved Hide resolved

elray1 and others added 2 commits August 28, 2024 11:50

Update R/score_model_out.R

7df1803

Co-authored-by: Nikos Bosse <[email protected]>

Update R/score_model_out.R

a9c2795

Co-authored-by: Nikos Bosse <[email protected]>

zkamvar mentioned this pull request Aug 28, 2024

Code in un-indented list cannot be parsed r-lib/roxygen2#1651

Open

elray1 added 2 commits August 29, 2024 11:51

updates to score_model_out docs

18ea11a

Merge branch 'score_model_out' of https://github.com/Infectious-Disea…

285b4d9

…se-Modeling-Hubs/hubEvals into score_model_out

zkamvar reviewed Sep 5, 2024

View reviewed changes

R/score_model_out.R Show resolved Hide resolved

nikosbosse reviewed Sep 9, 2024

View reviewed changes

elray1 added 2 commits September 11, 2024 16:53

do not support lists of functions in score_model_out

587c922

update docs

78f7295

This was referenced Sep 12, 2024

coverage of function factory epiforecasts/scoringutils#907

Open

Keep names but label in as_forecast? epiforecasts/scoringutils#908

Closed

zkamvar approved these changes Sep 12, 2024

View reviewed changes

elray1 merged commit a6cb8a8 into main Sep 12, 2024
8 checks passed

elray1 deleted the score_model_out branch September 12, 2024 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score model out #46

Score model out #46

elray1 commented Aug 20, 2024 •

edited

Loading

github-actions bot commented Aug 20, 2024 •

edited

Loading

zkamvar left a comment

nikosbosse left a comment

elray1 commented Aug 28, 2024 •

edited by zkamvar

Loading

elray1 commented Sep 5, 2024

zkamvar left a comment •

edited

Loading

nikosbosse Sep 9, 2024 •

edited

Loading

elray1 Sep 9, 2024

nikosbosse Sep 9, 2024

nikosbosse commented Sep 9, 2024

elray1 commented Sep 9, 2024

nikosbosse commented Sep 9, 2024

elray1 commented Sep 9, 2024

nikosbosse commented Sep 9, 2024

elray1 commented Sep 11, 2024 •

edited

Loading

elray1 commented Sep 11, 2024 •

edited by zkamvar

Loading

nikosbosse commented Sep 11, 2024

elray1 commented Sep 11, 2024 •

edited by zkamvar

Loading

nikosbosse commented Sep 11, 2024 •

edited

Loading

elray1 commented Sep 11, 2024

elray1 commented Sep 11, 2024

zkamvar left a comment

Score model out #46

Score model out #46

Conversation

elray1 commented Aug 20, 2024 • edited Loading

github-actions bot commented Aug 20, 2024 • edited Loading

zkamvar left a comment

Choose a reason for hiding this comment

nikosbosse left a comment

Choose a reason for hiding this comment

elray1 commented Aug 28, 2024 • edited by zkamvar Loading

elray1 commented Sep 5, 2024

zkamvar left a comment • edited Loading

Choose a reason for hiding this comment

nikosbosse Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

elray1 Sep 9, 2024

Choose a reason for hiding this comment

nikosbosse Sep 9, 2024

Choose a reason for hiding this comment

nikosbosse commented Sep 9, 2024

elray1 commented Sep 9, 2024

nikosbosse commented Sep 9, 2024

elray1 commented Sep 9, 2024

nikosbosse commented Sep 9, 2024

elray1 commented Sep 11, 2024 • edited Loading

elray1 commented Sep 11, 2024 • edited by zkamvar Loading

nikosbosse commented Sep 11, 2024

elray1 commented Sep 11, 2024 • edited by zkamvar Loading

nikosbosse commented Sep 11, 2024 • edited Loading

elray1 commented Sep 11, 2024

elray1 commented Sep 11, 2024

zkamvar left a comment

Choose a reason for hiding this comment

elray1 commented Aug 20, 2024 •

edited

Loading

github-actions bot commented Aug 20, 2024 •

edited

Loading

elray1 commented Aug 28, 2024 •

edited by zkamvar

Loading

zkamvar left a comment •

edited

Loading

nikosbosse Sep 9, 2024 •

edited

Loading

elray1 commented Sep 11, 2024 •

edited

Loading

elray1 commented Sep 11, 2024 •

edited by zkamvar

Loading

elray1 commented Sep 11, 2024 •

edited by zkamvar

Loading

nikosbosse commented Sep 11, 2024 •

edited

Loading