Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score model out #46

Merged
merged 35 commits into from
Sep 12, 2024
Merged

Score model out #46

merged 35 commits into from
Sep 12, 2024

Conversation

elray1
Copy link
Contributor

@elray1 elray1 commented Aug 20, 2024

This would resolve #13

Current status as of Aug 22:

  • have mostly implemented the stuff in the issue statement for mean, median, pmf and quantile output types
  • unit tests in place

Copy link

github-actions bot commented Aug 20, 2024

@elray1 elray1 marked this pull request as ready for review August 22, 2024 20:57
@elray1 elray1 changed the base branch from main to transform_pmf August 22, 2024 20:57
@zkamvar zkamvar self-requested a review August 22, 2024 22:14
Base automatically changed from transform_pmf to main August 23, 2024 13:32
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MY GLOB this is a herculean effort! I appreciate you spelling out what the scoringutils evaluations should be doing. I think your approach for doing the table joins and comparing the resulting columns is a good idea.

I (as always) have some suggestions that would reduce duplication, especially around error messages.

  1. put the abort statements into separate functions. They don't have to be validate_x. They could be check_x or error_if_not_x. This will make testing easier and reduce code complexity.
  2. for one-line if else blocks, change them to switch() statements to reduce complexity (and length).

Additionally, given that we are using examples from hubExamples, I think it might be prudent to add it as a suggested package and create examples using @examplesIf requireNamespace("hubExamples", quiet = TRUE) and replace the load() calls with the hubExamples data. This will make these test examples a bit more transparent.

tests/testthat/test-score_model_out.R Show resolved Hide resolved
tests/testthat/test-score_model_out.R Show resolved Hide resolved
tests/testthat/test-score_model_out.R Show resolved Hide resolved
tests/testthat/test-score_model_out.R Outdated Show resolved Hide resolved
tests/testthat/test-score_model_out.R Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
elray1 and others added 16 commits August 26, 2024 09:41
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Copy link
Collaborator

@nikosbosse nikosbosse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a first look at everything. As far as I can tell this should work - I'll check this out locally later and play around with the functions a bit more.

My main comment is related to the design of the functions. Maybe there is a way to simplify this a bit?

Some thoughts:

  • maybe we can offload more of the function documentation to scoringutils/reuse parts that are already documented there. Happy to work on this.
  • The metrics selection process seems very complicated, as users can provide either NULL, a character vector, or a list of functions. Not allowing character vectors would simplify things substantially

I'm lacking a bit of the overall context, i.e. I don't really know where and how the functions will be used which makes it a bit hard for me to provide helpful feedback. Overall, I think if I were to design the scoring functionality, I would probably do the following:

  • provide wrapper functions that transform from the hubverse format to the scoringutils format, leaving them with a validated forecast object that they can just pipe into score
  • let users score their forecasts in scoringutils land completely independently
  • provide functions to transform back to hubverse formats
  • within hubverse functions, do the same thing.

This would simplify the existing code a lot. I can see the appeal of having a single score_model_out function though.
In that case I would probably try to merge get_metrics() and get_metrics_default() (and get rid of get_metrics_character()) or even just have an if clause in score_model_out() to get rid of the se/ae depending on whether you're scoring the median or mean.

Happy to jump on a call and discuss this more if you think it would be helpful.

R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Outdated Show resolved Hide resolved
R/score_model_out.R Show resolved Hide resolved
elray1 and others added 2 commits August 28, 2024 11:50
Co-authored-by: Nikos Bosse <[email protected]>
Co-authored-by: Nikos Bosse <[email protected]>
@elray1
Copy link
Contributor Author

elray1 commented Aug 28, 2024

Thanks for the review, @nikosbosse! It's helpful to hear your ideas and comments.

Quick response to two easy points:

  • "maybe we can offload more of the function documentation to scoringutils/reuse parts that are already documented there". This sounds great.
  • "provide wrapper functions that transform from the hubverse format to the scoringutils format, leaving them with a validated forecast object that they can just pipe into score". This makes sense, and I actually see was written down in issue write function that translates from hubverse format to scoringutils-ready format #11 which I forgot about, with the idea that score_model_out would call that function.
    • I started on refactoring this with an idea that this could be included in this PR -- but then I decided that doing so would make review for this PR more burdensome, especially since both you and Zhian have already looked at the PR in its current state. So I propose that we get this PR done with an idea in mind that immediately after that I will go ahead with that refactor.

I think most of your remaining comments come down to two points (let me know if there are other ideas I'm missing):

  1. do we really want to support characters for the metrics?; and
  2. standardize a workflow of (a) transform to scoringutils formats; (b) do all the scoring and summarizing; (c) transform back to hubverse format

For point 1, my main reason for supporting character vectors is that I'd like to make it so users don't have to deal with assembling lists of functions themselves. This is particularly awkward for the interval coverage piece. Compare the following code to compute WIS, AE, and 80% and 95% interval coverage levels using a list of functions vs character strings. I would personally always choose to use the interface based on character strings here.

# metrics specified with a list of functions
scores_v1 <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c(
    scoringutils::metrics_quantile(select = c("wis", "ae_median", "interval_coverage_90")),
    list(
      interval_coverage_80 = purrr::partial(scoringutils::interval_coverage, interval_range = 80)
    )
  ),
  by = "model_id"
)

# metrics specified with character strings
scores_v2 <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c("wis", "ae_median", "interval_coverage_80", "interval_coverage_90"),
  by = "model_id"
)

For point 2, your comment makes sense. There are two things about this that are making me hesitate:

  • the "transform back to hubverse format" really just involves changing the column name from "model" to "model_id". There is also a subsetting of columns in there, though I'm not sure that falls under a heading of "transform to hubverse format" so much as "ensure we return a standardized set of output columns". Main point here being, I'm not sure it's worth pulling those items out into a separate function. The stuff they do is pretty minimal, and I think users who are manually working with scoringutils functions can handle the column renaming themselves if they want.
  • another related thought is that I'd prefer users of the package to be able to stick fully with the hubverse naming convention of "model_id" instead of "model" when using hubEvals::score_model_out. This is mainly relevant to the by specification. To support the use of by = "model_id" in the above examples, I've done the renaming of "model" --> "model_id" before the call to scoringutils::summarize. This means that the workflow inside of score_model_out is like:
  1. convert to scoringutils format
  2. compute "raw scores"
  3. convert to a standardized hubverse format: mainly, change column name back to "model_id" (and also drop predicted/observed columns if present)
  4. compute score summaries using by

If we switch the order of steps 3 and 4, I guess we could change any entries of by that are equal to "model_id" to "model" before calling summarize, but that feels like an awkward step just to support moving that scoringutils::summarize_scores call up.

(Like you, I'd be happy to get on a call to discuss further! Thought getting these first thoughts down in a comment on the PR would be good form though.)

@elray1
Copy link
Contributor Author

elray1 commented Sep 5, 2024

Hi @nikosbosse and @zkamvar -- checking in; do you all have any more comments or suggestions on this PR? Happy to set up a time to chat on a call if that would be helpful. Thanks!

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that I started a review and then never finished it -_-

I was thinking about the character input controversy and I noticed that you are already doing this with the intervals: create a function factory that returns a list of interval coverage partials.

interval_coverage_of <- function(range) {
  res <- lapply(range, function(r) {
    purrr::partial(scoringutils::interval_coverage, interval_range = range)
  })
  names(res) <- sprintf("interval_coverage_%", range)
  return(res)
}

With this, then the user's call becomes simplified and we don't have to jump through hoops to validate the character inputs.

# metrics specified with a list of functions
scores_v1 <- hubExamples::forecast_outputs |>
  dplyr::filter(.data[["output_type"]] == "quantile") |>
  score_model_out(
    target_observations = hubExamples::forecast_target_observations,
    metrics = c(
      scoringutils::metrics_quantile(select = c("wis", "ae_median")),
      interval_coverage_of(range = c(80, 90))
    ),
    by = "model_id" 
  )

R/score_model_out.R Show resolved Hide resolved
metrics <- get_metrics(metrics, output_type, output_type_id_order)

# assemble data for scoringutils
su_data <- switch(output_type,
Copy link
Collaborator

@nikosbosse nikosbosse Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small note - long-term it might be a bit more elegant to implement transform_pmf_model() as an S3 method. Then we could omit the switch and just call transform_model_out(model_out_tbl, target_observations)
(don't think it makes much sense to do that now though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum, this switch will be refactored into a transform_model_out function per issue #11.

Currently, the hubverse tooling does not have separate S3 classes per output type; mostly, our tools accept data frames containing a mix of output types. But it could be worth thinking about whether there are other places (e.g. plotting?) where functionality is specific to the output type and having this kind of class structure would be helpful.

#'
#' @noRd
get_metrics_default <- function(output_type, output_type_id_order) {
metrics <- switch(output_type,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll be changing this in scoringutils by replacing functions like metrics_quantile(), metrics_point() etc. with a single get_metrics() function that is an S3 generic. So if you called get_metrics(forecast_quantile_object) you'd get the quantile metrics. This would simplify this code because one could transform the forecasts into a forecast object first and then rely on S3 to get the correct metrics.

@nikosbosse
Copy link
Collaborator

@elray1 apologies for the delayed reply. Overall I think the PR is fine, albeit a bit complicated. I imagine that this planned change in scoringutils epiforecasts/scoringutils#832 might bring some simplifications.

Depending on your desired timeline, I could either try and implement the mentioned change in scoringutils as quickly as possible and then make a second PR targeting this one that aims to simplify it and incorporates the update. Alternatively, we could merge this first and I do my PR afterwards.

I don't think we are in much disagreement regarding the overall workflow. (I would personally ask users to use scoringutils directly and avoid the wrappers, but I definitely see the rationale for that choice and think it's valid).
(One other potential idea: splitting out the "get your metrics" functionality out of score_model_out(), i.e. have a separate function that users can use if they want to e.g. specify metrics using a string and that function would output a list of functions ready for scoringutils. It would feel cleaner code-wise, but not sure 🤷)

Assuming we're not planning any changes for the overall workflow and how users interact with the functions we probably don't need a call. Most issues are mostly implementation details of how to simplify the code and maybe the docs.

@elray1
Copy link
Contributor Author

elray1 commented Sep 9, 2024

Thanks, Nikos. My general preference would be to merge this in sooner than later to keep PRs lightweight and reviews easier. That said, I plan to bring the question about what interface(s) we want to provide to the next devteam call to get some broader input. That means a merge of this PR won't happen until ~Wednesday or Thursday this week, so if writing the S3-based get_metrics function in scoringutils could be done on that timeline, I'd be happy to incorporate it here. Otherwise, I think it's perfectly fine to make that update in a separate PR.

I think we're up to 5 (not necessarily mutually exclusive) ideas for what the interface might look like now:

  1. score_model_out(..., metrics = <NULL>, ...), we provide recommended defaults depending on output type
  2. score_model_out(..., metrics = <character vector>, ...), we parse interval coverage rates to assemble list of functions
  3. score_model_out(..., metrics = <list of functions>, ...)
  4. score_model_out(..., metrics = <list of functions>, ...), we provide an interval_coverage_of function factory to help with interval functions
  5. score_model_out(..., metrics = <list of functions>, ...), we provide a separate get_metrics function that accepts a character vector and returns a list of functions, essentially making part of what's done behind the scenes in item 2 a public method. (Though I guess we should think of a name other than get_metrics to avoid namespace collision with scoringutils::get_metrics... 🤔)

As I said above, I'm planning to bring the question of which of these we should support to the hubverse devteam call to get broader input before merging this.

@nikosbosse
Copy link
Collaborator

@elray1 sounds good! Let me know in case you'd like me to join the devteam call.

@elray1
Copy link
Contributor Author

elray1 commented Sep 9, 2024

@nikosbosse you're always welcome to join the hubverse devteam call, but I don't think there's any specific need in this instance. Do you have the call info, or would you like me to get you added to the calendar invite?

@nikosbosse
Copy link
Collaborator

sounds good. I don't have the details. Feel free to add me whenever you think it's useful for me to attend but I think it only makes sense for me to join in instances where you feel it's valuable.

@elray1
Copy link
Contributor Author

elray1 commented Sep 11, 2024

The decision on the hubverse devteam call was to support options 1 and 4 on the list above, namely:

  • score_model_out(..., metrics = <NULL>, ...), we provide recommended defaults depending on output type
  • score_model_out(..., metrics = <list of functions>, ...), we provide an interval_coverage_of function factory to help with interval functions

I will update this PR to reflect that decision.

@elray1
Copy link
Contributor Author

elray1 commented Sep 11, 2024

I just realized that this plan is slightly awkward for users in light of the changes that @nikosbosse is implementing over in epiforecasts/scoringutils#903, which will create S3 methods for each forecast class. For users of hubEvals::score_model_out, an object of class forecast_* won't be available yet at the time they call hubEvals::score_model_out, so they won't be able to just call scoringutils::get_metrics(). Instead, they would need to call the S3 method for the forecast class that will eventually be created, e.g.:

quantile_scores <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = c(
    scoringutils::get_metrics.forecast_quantile(select = c("wis", "ae_median")),
    interval_coverage_of(range = c(80, 90))
  ),
  by = c("model_id")
)

Making this even worse, the names of these things don't always exactly align with the names they have in the hubverse. Two examples:

  1. In the hubverse we have a single pmf output type that may be used to represent either nominal or ordinal forecasts, while scoringutils is planned to have separate forecast_ordinal or forecast_nominal types.
  2. The hubverse distinguishes between mean and median forecasts, while scoringutils just has forecast_point.

So the right scoringutils::get_metrics.forecast_* would require more detailed knowledge of how hubverse and scoringutils concepts align than I'd ideally like to ask of our users.

@nikosbosse
Copy link
Collaborator

We also made a change that the example data in scoringutils is now pre-validated. So you can call get_metrics(example_quantile) instead.

We could also just have a function with the default metrics here that simply wraps get_metrics()

@elray1
Copy link
Contributor Author

elray1 commented Sep 11, 2024

OK, if I'm understanding the second part of the suggestion correctly, this sounds more like item 5 on the list above. So, we'd provide something like:

#' Get metrics functions to use with `scoringutils::score`
#'
#' @param metrics Character vector: names of metrics. See documentation for scoringutils::get_metrics for options.
#' @param interval_coverage Numeric vector: interval coverage levels. Each entry must be between 0 and 100
#' @param output_type Character string: the `output_type` of the model outputs that will be scored
#' @param is_ordinal Boolean: indicator of whether the target is nominal (`is_ordinal = FALSE`) or ordinal (`is_ordinal = TRUE`).  Relevant only if the `output_type` is `"pmf"`.
#'
#' @return named list of metric functions
get_metric_functions(metrics = NULL, interval_coverage = NULL, output_type, is_ordinal) {
  # validate that requested metric names and interval_coverage are consistent with output_type and is_ordinal

  # get metric functions based on metrics, using the appropriate forecast class

  # if applicable, add on interval coverage metrics based on coverage_levels, no string parsing required
}

Then, a hubEvals user might write:

quantile_scores <- score_model_out(
  model_out_tbl = hubExamples::forecast_outputs |>
    dplyr::filter(.data[["output_type"]] == "quantile"),
  target_observations = hubExamples::forecast_target_observations,
  metrics = get_metric_functions(
    metrics = c("wis", "ae_median"),
    interval_coverage = c(80, 90),
    output_type = "quantile"
  ),
  by = "model_id"
)
  • I think I'm on board with that?
  • Is that what you were proposing?

@nikosbosse
Copy link
Collaborator

nikosbosse commented Sep 11, 2024

@elray1 I was thinking of something like this:

metrics_median <- function(select = NULL, exclude = NULL) {
  exclude <- unique(c("se_point", exclude))
  metrics <- scoringutils::get_metrics(example_point, select = select, exclude = exclude)
  return(metrics)
}

Basically you would have onemetrics_<type> function for every type that you have, returning a list of functions. And this function could just wrap the corresponding scoringutils function or alternatively alter the outputs a bit.

(the function arguments of course don't have to be called select and exclude - that's just what we use in scoringutils. You could just have a single metrics arg. Or an additional interval_coverage arg for metrics_quantile() etc.

You could then also wrap all these metrics_<type> in a single big get_metrics_function() that is able to handle all types at once, but not sure that's better.

@elray1
Copy link
Contributor Author

elray1 commented Sep 11, 2024

Nick and I just had a discussion about this and made the following decision: For score_model_out, we will support two kinds of arguments:

  1. NULL: we provide our default recommended metrics, specific to the output_type.
  2. a character vector, including metrics that are supported by default by scoringutils as well as "interval_coverage_XY", which will handle interval coverage rates for the most common cases like "interval_coverage_95".

This addresses the concern about the complexity of this function supporting 3 kinds of arguments, as lists of functions would no longer be supported. It also maintains support for the vast majority of our users to use this function in a natural way. The downside is that it means that means that score_model_out will not support users who want to use metrics that are not built into scoringutils or interval coverage at levels like 99.9. For those users, there are two paths to allow them to do their evaluations:

  1. Short term, they can do the analysis more manually, using the transform_model_out function to get data in the format needed by scoringutils and then calling scoringutils functionality directly from there on.
  2. Longer term, we can work to get additional metrics added to scoringutils so that they can be specified with strings.

@elray1
Copy link
Contributor Author

elray1 commented Sep 11, 2024

I have removed the support for providing a list of functions to metrics in 587c922, and am ready for a re-review of that. Thanks!

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's a shame we couldn't implement the function methods, but I think this is still a good compromise.

@elray1 elray1 merged commit a6cb8a8 into main Sep 12, 2024
8 checks passed
@elray1 elray1 deleted the score_model_out branch September 12, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write wrapper function like hubEvals::score_model_out()
3 participants