`pilot_tune_grid` function for running a few quick fits #722

kbodwin · 2023-09-21T23:31:51Z

Hi friends,

This is a cute li'l add-on function that I want to propose, based on this fosstodon post.

Basically it takes a the setup of a tune_grid() process, and runs the fit/validate on only the largest and smallest grid values; and only on the first resampling split. The idea is to quick-test the workflow spec(s) on the grid values where they are most likely to break, in order to spot mistakes and poor choices in the recipe or grid, before running a massive or costly tuning process.

It only required some very trivial tweaks to tune_grid(), and you might argue that it's not necessary; certainly a savvy user could easily chop down their grid and their resamples manually.

To that I say:
(1) if they are relying on the automatic gridding then it's a few extra fiddly steps to make and then reduce a grid manually; and,
(2) there's something really elegant to me about running workflow_map("pilot_tune_grid") with all of your workflowsets and chosen options, and then once you've troubleshooted all the issues that come up, you only need to change "pilot_tune_grid" to "tune_grid" to run the whole shebang.

It also might make sense to add some options, like to use 2 resample splits instead of 1, or to use all extreme combinations of grid parameters instead of just "all minimum values" and "all maximum values". But at some point, yeah, the user should just make the reduced grid themselves.

Anyhoo. I y'all decide you like it, I'm happy to do the first pass on unit tests and maybe making a pilot_fit_resamples() (mainly for use with workflowsets, not much value to it on one model).

simonpcouch

Heard. I'm into this idea.🙂

Here's an initial pass on review! I'd say thumbs up on moving on to unit tests, waiting on pilot_fit_resamples() pending on further discussion about how agnostic to the tuning approach this function should be oriented.

simonpcouch · 2023-09-22T16:41:14Z

R/pilot_tune_grid.R

+#' parameters that correspond to a model or recipe and fits the model on only
+#' one of the resample splits, for purposes of quick error checking.
+#'
+#' @param object A `parsnip` model specification or a [workflows::workflow()].


Let's @inheritParams these.

simonpcouch · 2023-09-22T16:45:05Z

R/pilot_tune_grid.R

+#' See [tune_grid()] for details of defaults and computation.
+#'
+#' @export
+pilot_tune_grid <- function(object, ...) {


Some hesitation from me on the name. Ostensibly this could be useful in racing and simulated annealing as well, and in that case I think this name would suggest we ought to also have methods for those other tuning approaches, so maybe a name that's agnostic to the tuning approach?

I had thought tune_pilot() for a moment but I like your convention of breaking from the tune_*() pattern to be clear this ought not be used in place of others matching the pattern. Max suggested pilot_candidates(). Not sure if I have strong suggestions on names yet.

simonpcouch · 2023-09-22T16:47:55Z

R/pilot_tune_grid.R

+  )
+
+  # Cut grid down to extreme values only for piloting
+  grid <- grid |>


Suggested change

grid <- grid |>

grid <- grid %>%

Let's use the magrittr pipe—we don't yet require R 4.1.0 from our users!

tune/DESCRIPTION

Line 18 in 74854a5

R (>= 4.0)

And here I thought I was being hip and modern in this PR... because I haven't actually converted from base pipe in my own work yet! I'll swap it back.

simonpcouch · 2023-09-22T16:48:44Z

R/pilot_tune_grid.R

+
+# ------------------------------------------------------------------------------
+
+pilot_tune_grid_workflow <- function(workflow,


I appreciate your eye for the way other tuning functions do this in your architecting.🙂

copy-paste-tweak ftw

simonpcouch · 2023-09-22T16:51:20Z

R/pilot_tune_grid.R

+
+  # Cut grid down to extreme values only for piloting
+  grid <- grid |>
+    arrange(across(matches("."))) |>


I think we might want to look into a more performant variant of this line if we imagine this helper being iterated over, as in workflow_map().

Also, we try to call almost everything by namespace in case it is ever used in PSOCK clusters for parallel processing.

I'll pull it into a helper function (choose_grid_candidates or something) and then that helper function can be changed in the future.

simonpcouch · 2023-09-22T16:56:33Z

R/pilot_tune_grid.R

+    wflow <- add_formula(wflow, preprocessor)
+  }
+
+  tune_grid(


I think we'd like this to call pilot_tune_grid.workflow(), via pilot_tune_grid()?

Ah yes, that was the intent, thanks.

simonpcouch · 2023-09-22T16:57:57Z

R/pilot_tune_grid.R

+  resamples <- new_bare_tibble(resamples)
+
+  # Use only the first resampling split
+  resamples <- resamples |> slice(1)


Suggested change

resamples <- resamples |> slice(1)

resamples <- resamples |> vctrs::vec_slice(1)

With an eye for performance.

There are rules for modifying rset objects. Almost always, slicing them will revert the results to basic tibbles. For example, if we have a 10-fold CV rset and get rid of some rows it isn't the same thing and we drop it's rsample classes (and perhaps some attributes too).

We do this in racing and this function is there to restructure/reclass the object.

I believe the classes are handled correctly here via the extraction of pull_rset_attributes() and passing a bare tibble directly to tune_grid_loop().

This was my understanding too, that resamples gets stripped down to plain tibble first and so I could slice it as such. But I only understand the architecture here enough to be dangerous so please do let me know if I should approach this differently!

simonpcouch · 2023-09-22T16:58:06Z

R/pilot_tune_grid.R

+  # Cut grid down to extreme values only for piloting
+  grid <- grid |>
+    arrange(across(matches("."))) |>
+    slice(1, nrow(grid))


Suggested change

slice(1, nrow(grid))

vctrs::vec_slice(1, nrow(grid))

topepo · 2023-09-22T17:11:14Z

I'm positive about doing this but we need to consider the use case and goals. Would it be:

"A way to test that our models would work when all grid points are used with the resamples before doing all of the computing"

"All of the grid points", as you said, may not be the goal. It wouldn't be hard to figure out which candidates are on the outskirts of the hyper-parameter space and test some of those.

Also, should we use the resamples? Would the training set (or some random fraction) be just as good?

I was thinking of pilot_candidates() to move it away from the formal concept of grid search.

EmilHvitfeldt · 2023-09-22T17:39:02Z

I was thinking of pilot_candidates() to move it away from the formal concept of grid search.

I like the idea that this new function will have a similar interface to functions like tune_grid() in the sense that it should be minimal work to go between tune_grid() and New_function() when you write

kbodwin · 2023-09-26T21:37:31Z

"All of the grid points", as you said, may not be the goal. It wouldn't be hard to figure out which candidates are on the outskirts of the hyper-parameter space and test some of those.

Yes, I don't think we'd want to try all grid points - just looking for some breaking on the extremes before running the full tuning and workflow map.

Right now it's written to try one grid set with all hyperparameters minimized, and one with all maximized. I agree that there's probably a smarter way to select which ones to try, but it may be beyond my skills. :)

Also, should we use the resamples? Would the training set (or some random fraction) be just as good?

It seems really important to me that we at least use the same splitting approach as will be used in the full tune. For example, if the user is using 10 folds with a stratification, that's the type and size of resample we should "pilot" on.

I was thinking of pilot_candidates() to move it away from the formal concept of grid search.

Sure - but there's a different structure to piloting a planned grid search vs piloting a workflowset that doesn't include any tuning, so presumably those would need to be two different named functions regardless.

kbodwin · 2023-09-26T21:54:40Z

Given the cautiously optimistic replies, I'll try to steal some time this weekend to make the review changes and write a few unit tests and stuff.

Regarding the names, I initially had test_drive() or try_models() but "test" and "try" feel like very reserved words. Totally open to creative ideas that convey "this is just a trial run to stress test the design, not an actual analysis".

topepo · 2024-01-10T16:26:02Z

Sorry for the inactivity. We're focused on myriad survival analysis changes right now.

simonpcouch · 2024-03-26T18:23:53Z

Coming after to this after a good while: I'm going to go ahead and close in favor of #872. I think the interface proposed there is more likely to be the approach that we'd go for, but glad that we have this PR to come back to to see some discussion points.

Thank you, @kbodwin!

github-actions · 2024-04-10T00:34:16Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

kbodwin added 4 commits September 21, 2023 18:13

add pilot_tune_grid function

1bf5ef8

update docs

6dc4d37

update imports

e456194

update docs

2d65b9a

simonpcouch reviewed Sep 22, 2023

View reviewed changes

simonpcouch mentioned this pull request Mar 18, 2024

FR: Add option to control_* functions to do a test run on only one resample to check for warnings/errors #872

Open

simonpcouch closed this Mar 26, 2024

github-actions bot locked and limited conversation to collaborators Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pilot_tune_grid` function for running a few quick fits #722

`pilot_tune_grid` function for running a few quick fits #722

kbodwin commented Sep 21, 2023

simonpcouch left a comment

simonpcouch Sep 22, 2023

simonpcouch Sep 22, 2023

simonpcouch Sep 22, 2023

kbodwin Sep 26, 2023

simonpcouch Sep 22, 2023

kbodwin Sep 26, 2023

simonpcouch Sep 22, 2023

topepo Sep 22, 2023

kbodwin Sep 26, 2023

simonpcouch Sep 22, 2023

kbodwin Sep 26, 2023

simonpcouch Sep 22, 2023

topepo Sep 22, 2023

simonpcouch Sep 22, 2023

kbodwin Sep 26, 2023

simonpcouch Sep 22, 2023

topepo commented Sep 22, 2023

EmilHvitfeldt commented Sep 22, 2023

kbodwin commented Sep 26, 2023

kbodwin commented Sep 26, 2023

topepo commented Jan 10, 2024

simonpcouch commented Mar 26, 2024

github-actions bot commented Apr 10, 2024


		# ------------------------------------------------------------------------------

		pilot_tune_grid_workflow <- function(workflow,

	resamples <- resamples \|> slice(1)
	resamples <- resamples \|> vctrs::vec_slice(1)

pilot_tune_grid function for running a few quick fits #722

pilot_tune_grid function for running a few quick fits #722

Conversation

kbodwin commented Sep 21, 2023

simonpcouch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topepo commented Sep 22, 2023

EmilHvitfeldt commented Sep 22, 2023

kbodwin commented Sep 26, 2023

kbodwin commented Sep 26, 2023

topepo commented Jan 10, 2024

simonpcouch commented Mar 26, 2024

github-actions bot commented Apr 10, 2024

`pilot_tune_grid` function for running a few quick fits #722

`pilot_tune_grid` function for running a few quick fits #722