-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Score model out #46
Score model out #46
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MY GLOB this is a herculean effort! I appreciate you spelling out what the scoringutils
evaluations should be doing. I think your approach for doing the table joins and comparing the resulting columns is a good idea.
I (as always) have some suggestions that would reduce duplication, especially around error messages.
- put the abort statements into separate functions. They don't have to be
validate_x
. They could becheck_x
orerror_if_not_x
. This will make testing easier and reduce code complexity. - for one-line
if else
blocks, change them toswitch()
statements to reduce complexity (and length).
Additionally, given that we are using examples from hubExamples
, I think it might be prudent to add it as a suggested package and create examples using @examplesIf requireNamespace("hubExamples", quiet = TRUE)
and replace the load()
calls with the hubExamples data. This will make these test examples a bit more transparent.
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a first look at everything. As far as I can tell this should work - I'll check this out locally later and play around with the functions a bit more.
My main comment is related to the design of the functions. Maybe there is a way to simplify this a bit?
Some thoughts:
- maybe we can offload more of the function documentation to
scoringutils
/reuse parts that are already documented there. Happy to work on this. - The metrics selection process seems very complicated, as users can provide either
NULL
, a character vector, or a list of functions. Not allowing character vectors would simplify things substantially
I'm lacking a bit of the overall context, i.e. I don't really know where and how the functions will be used which makes it a bit hard for me to provide helpful feedback. Overall, I think if I were to design the scoring functionality, I would probably do the following:
- provide wrapper functions that transform from the hubverse format to the scoringutils format, leaving them with a validated
forecast
object that they can just pipe intoscore
- let users score their forecasts in scoringutils land completely independently
- provide functions to transform back to hubverse formats
- within hubverse functions, do the same thing.
This would simplify the existing code a lot. I can see the appeal of having a single score_model_out
function though.
In that case I would probably try to merge get_metrics()
and get_metrics_default()
(and get rid of get_metrics_character()
) or even just have an if clause in score_model_out()
to get rid of the se/ae depending on whether you're scoring the median or mean.
Happy to jump on a call and discuss this more if you think it would be helpful.
Co-authored-by: Nikos Bosse <[email protected]>
Co-authored-by: Nikos Bosse <[email protected]>
Thanks for the review, @nikosbosse! It's helpful to hear your ideas and comments. Quick response to two easy points:
I think most of your remaining comments come down to two points (let me know if there are other ideas I'm missing):
For point 1, my main reason for supporting character vectors is that I'd like to make it so users don't have to deal with assembling lists of functions themselves. This is particularly awkward for the interval coverage piece. Compare the following code to compute WIS, AE, and 80% and 95% interval coverage levels using a list of functions vs character strings. I would personally always choose to use the interface based on character strings here. # metrics specified with a list of functions
scores_v1 <- score_model_out(
model_out_tbl = hubExamples::forecast_outputs |>
dplyr::filter(.data[["output_type"]] == "quantile"),
target_observations = hubExamples::forecast_target_observations,
metrics = c(
scoringutils::metrics_quantile(select = c("wis", "ae_median", "interval_coverage_90")),
list(
interval_coverage_80 = purrr::partial(scoringutils::interval_coverage, interval_range = 80)
)
),
by = "model_id"
)
# metrics specified with character strings
scores_v2 <- score_model_out(
model_out_tbl = hubExamples::forecast_outputs |>
dplyr::filter(.data[["output_type"]] == "quantile"),
target_observations = hubExamples::forecast_target_observations,
metrics = c("wis", "ae_median", "interval_coverage_80", "interval_coverage_90"),
by = "model_id"
) For point 2, your comment makes sense. There are two things about this that are making me hesitate:
If we switch the order of steps 3 and 4, I guess we could change any entries of (Like you, I'd be happy to get on a call to discuss further! Thought getting these first thoughts down in a comment on the PR would be good form though.) |
Hi @nikosbosse and @zkamvar -- checking in; do you all have any more comments or suggestions on this PR? Happy to set up a time to chat on a call if that would be helpful. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that I started a review and then never finished it -_-
I was thinking about the character input controversy and I noticed that you are already doing this with the intervals: create a function factory that returns a list of interval coverage partials.
interval_coverage_of <- function(range) {
res <- lapply(range, function(r) {
purrr::partial(scoringutils::interval_coverage, interval_range = range)
})
names(res) <- sprintf("interval_coverage_%", range)
return(res)
}
With this, then the user's call becomes simplified and we don't have to jump through hoops to validate the character inputs.
# metrics specified with a list of functions
scores_v1 <- hubExamples::forecast_outputs |>
dplyr::filter(.data[["output_type"]] == "quantile") |>
score_model_out(
target_observations = hubExamples::forecast_target_observations,
metrics = c(
scoringutils::metrics_quantile(select = c("wis", "ae_median")),
interval_coverage_of(range = c(80, 90))
),
by = "model_id"
)
metrics <- get_metrics(metrics, output_type, output_type_id_order) | ||
|
||
# assemble data for scoringutils | ||
su_data <- switch(output_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small note - long-term it might be a bit more elegant to implement transform_pmf_model()
as an S3 method. Then we could omit the switch
and just call transform_model_out(model_out_tbl, target_observations)
(don't think it makes much sense to do that now though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum, this switch
will be refactored into a transform_model_out
function per issue #11.
Currently, the hubverse tooling does not have separate S3 classes per output type; mostly, our tools accept data frames containing a mix of output types. But it could be worth thinking about whether there are other places (e.g. plotting?) where functionality is specific to the output type and having this kind of class structure would be helpful.
#' | ||
#' @noRd | ||
get_metrics_default <- function(output_type, output_type_id_order) { | ||
metrics <- switch(output_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll be changing this in scoringutils
by replacing functions like metrics_quantile()
, metrics_point()
etc. with a single get_metrics()
function that is an S3 generic. So if you called get_metrics(forecast_quantile_object)
you'd get the quantile metrics. This would simplify this code because one could transform the forecasts into a forecast object first and then rely on S3 to get the correct metrics.
@elray1 apologies for the delayed reply. Overall I think the PR is fine, albeit a bit complicated. I imagine that this planned change in Depending on your desired timeline, I could either try and implement the mentioned change in I don't think we are in much disagreement regarding the overall workflow. (I would personally ask users to use Assuming we're not planning any changes for the overall workflow and how users interact with the functions we probably don't need a call. Most issues are mostly implementation details of how to simplify the code and maybe the docs. |
Thanks, Nikos. My general preference would be to merge this in sooner than later to keep PRs lightweight and reviews easier. That said, I plan to bring the question about what interface(s) we want to provide to the next devteam call to get some broader input. That means a merge of this PR won't happen until ~Wednesday or Thursday this week, so if writing the S3-based I think we're up to 5 (not necessarily mutually exclusive) ideas for what the interface might look like now:
As I said above, I'm planning to bring the question of which of these we should support to the hubverse devteam call to get broader input before merging this. |
@elray1 sounds good! Let me know in case you'd like me to join the devteam call. |
@nikosbosse you're always welcome to join the hubverse devteam call, but I don't think there's any specific need in this instance. Do you have the call info, or would you like me to get you added to the calendar invite? |
sounds good. I don't have the details. Feel free to add me whenever you think it's useful for me to attend but I think it only makes sense for me to join in instances where you feel it's valuable. |
The decision on the hubverse devteam call was to support options 1 and 4 on the list above, namely:
I will update this PR to reflect that decision. |
I just realized that this plan is slightly awkward for users in light of the changes that @nikosbosse is implementing over in epiforecasts/scoringutils#903, which will create S3 methods for each forecast class. For users of quantile_scores <- score_model_out(
model_out_tbl = hubExamples::forecast_outputs |>
dplyr::filter(.data[["output_type"]] == "quantile"),
target_observations = hubExamples::forecast_target_observations,
metrics = c(
scoringutils::get_metrics.forecast_quantile(select = c("wis", "ae_median")),
interval_coverage_of(range = c(80, 90))
),
by = c("model_id")
) Making this even worse, the names of these things don't always exactly align with the names they have in the hubverse. Two examples:
So the right |
We also made a change that the example data in scoringutils is now pre-validated. So you can call We could also just have a function with the default metrics here that simply wraps |
OK, if I'm understanding the second part of the suggestion correctly, this sounds more like item 5 on the list above. So, we'd provide something like: #' Get metrics functions to use with `scoringutils::score`
#'
#' @param metrics Character vector: names of metrics. See documentation for scoringutils::get_metrics for options.
#' @param interval_coverage Numeric vector: interval coverage levels. Each entry must be between 0 and 100
#' @param output_type Character string: the `output_type` of the model outputs that will be scored
#' @param is_ordinal Boolean: indicator of whether the target is nominal (`is_ordinal = FALSE`) or ordinal (`is_ordinal = TRUE`). Relevant only if the `output_type` is `"pmf"`.
#'
#' @return named list of metric functions
get_metric_functions(metrics = NULL, interval_coverage = NULL, output_type, is_ordinal) {
# validate that requested metric names and interval_coverage are consistent with output_type and is_ordinal
# get metric functions based on metrics, using the appropriate forecast class
# if applicable, add on interval coverage metrics based on coverage_levels, no string parsing required
} Then, a hubEvals user might write: quantile_scores <- score_model_out(
model_out_tbl = hubExamples::forecast_outputs |>
dplyr::filter(.data[["output_type"]] == "quantile"),
target_observations = hubExamples::forecast_target_observations,
metrics = get_metric_functions(
metrics = c("wis", "ae_median"),
interval_coverage = c(80, 90),
output_type = "quantile"
),
by = "model_id"
)
|
@elray1 I was thinking of something like this:
Basically you would have one (the function arguments of course don't have to be called You could then also wrap all these |
Nick and I just had a discussion about this and made the following decision: For
This addresses the concern about the complexity of this function supporting 3 kinds of arguments, as lists of functions would no longer be supported. It also maintains support for the vast majority of our users to use this function in a natural way. The downside is that it means that means that
|
I have removed the support for providing a list of functions to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It's a shame we couldn't implement the function methods, but I think this is still a good compromise.
This would resolve #13
Current status as of Aug 22: