-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize metadata merging with data #104
Conversation
@LDSamson & @aclark02-arcus, what are your opinions on this approach to study specific modifications in It obviously will need to be extended out further before merging and should probably not have a default value. But this approach will allow us to do more data checking on the process. Much better than just piping additional changes onto the end of |
Yes, I thought of going into this direction. Do you think we need to implement multiple functions, or can just use one function? In the latter case, we can maybe provide the function name as a string, and just do something like do.call("custom_function_study_x", list(merged_data)). In this case, we could define the custom function name in either the config.yml or in the metadata. What do you think? |
I like the idea of being able to pass multiple functions. This would allow an org to segment that work up. (e.g. Some modifications might be needed for only one study, but it shares modifications with other studies. Then the additional mods could just be piped onto the general function). Since If part of the motivation is to utilize base R over |
I do see some benefit in adding it to the configuration file. Since the pre-processing is probably happening in the same active configuration that the app is deployed in, it could allow an organization to have a uniform pre-processing script where the difference is the active configuration. |
Thanks for describing the use case. I think that makes sense indeed, to be able to use multiple functions. For our internal use, it would be great if we can pass the name of the custom functions directly as something like a character vector. Regarding adding it to the config file: currently for our deployment, we are keeping the config.yml the same between studies (although this might change in the future). If we add the pre-processing functions, wouldn't it be easier to add them in the metadata since the functions will change with each study? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what you guys think about this change. Basic idea is that there is a sheet in the metadata file where settings can be stored. Since settings could have more than one values, the default data frame is simplified to a list with NA's removed.
R/fct_appdata.R
Outdated
@@ -112,7 +112,9 @@ merge_meta_with_data <- function( | |||
"item_value" = VAL, | |||
"reason_notdone" = LBREASND | |||
) |> | |||
apply_study_specific_fixes() | |||
Reduce(\(x1, x2) do.call(x2, list(x1)), # Apply next function to output of previous | |||
meta$settings$add_fns %||% "identity", # Return merged data if no additional functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay to set the default to apply_study_specific_fixes
instead of identity until we added more custom settings in the metadata, and we phase out the apply_study_specific_fixes
function? Maybe we find a more fitting name than 'settings' if we implement more functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I want to generalize the merge_meta_with_data()
function further in this PR. I just wanted to get input on how we wanted to allow custom modification. I see a few other things I want to complete in this PR.
- Generalize `add_timevars_to_data().
- Prioritize user input (e.g. If the user has already supplied
event_name
,region
, etc., the app should prioritize that value over imputation). - Add checkers at the end of the function call. Ensure that needed imputed variables are created and valid.
But if you would rather us break this into multiple PRs, I can change the default and clean things up for a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this PR larger, adding multiple generalizations (and keeping the 'identity' in that case), is also okay for me.
I like this direction. Just brainstorming, not sure if the following would be the right way: I would see that we have a general set of custom functions that we can reuse, that would be nice. One for each variable that we want to adjust. I am just thinking: what if we could also set arguments? Or would that become too difficult? Something like this in the settings Excel tab:
If we do this, we should probably create a helper function to robustly add the correct arguments to the right functions. A more robust version of below, for each function:
Alternative is to be able to set function arguments of the custom functions in environment variables or so (maybe easier)? |
I want to keep it simple for now. It's fairly simple for a user to create wrappers of the base function with different argument defaults. I fear that adding this complexity will make the function hard to maintain and buggy. Plus right now the error produced is pretty straightforward to the end user. Plus I was thinking we could expand the setting tab to allow user modification elsewhere in the application. Such as defaults for form tables. |
Yes that makes sense, now I thought a bit longer about it. Regarding regions: see also #106 |
@LDSamson I think I want to cut off this PR here. I did not really add any additional checks (in part because it's not clear what those checks should be). I think checks and documentation on function usage (maybe a vignette?) can be added in a follow up PR. |
I detected some differences when merging real study data with metadata with the old vs the new method. I will try to find out what was the issue.. |
@LDSamson did you discover where you local difference were coming from? |
R/golem_utils_server.R
Outdated
#' If x does not exist, return y, otherwise return x | ||
#' | ||
#' @param x,y two elements to test, one potentially not existent | ||
#' @param verbose logical, indicating whether warning message should be displayed. | ||
#' | ||
#' @noRd | ||
#' | ||
#' @examples | ||
#' mtcars2 %|_|% mtcars | ||
"%|_|%" <- function(x, y, verbose = TRUE) { | ||
if (exists(deparse1(substitute(x)))) { | ||
if (verbose) warning(paste("Using user supplied", deparse(deparse1(substitute(x))), "instead of deriving."), call. = FALSE) | ||
x | ||
} else { | ||
y | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue of this not working within dplyr::mutate() is that it looks for existence in the incorrect environment.
Maybe this is a solution, can you check if this can be used in all cases?
exists(...., envir = parent.frame() )
Once you get it working, it would be great if you can also add small unit tests where you test it in different environments (e.g. in a global environment and in something a dplyr::mutate() call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work. I'm trying to think if there is any situation in which it would not work. Seems like the parent frame is the environment this function would always be evaluated inside of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we can come up with a situation where using exists(..., envir = parent.frame())
will produce the wrong result, the last commit should resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments. Most important is to get the function %|_|%
working within mutate()
@LDSamson I'm going to add a test for |
R/golem_utils_server.R
Outdated
#' mtcars2 %|_|% mtcars | ||
"%|_|%" <- function(x, y, verbose = TRUE) { | ||
if (exists(deparse1(substitute(x)), envir = parent.frame())) { | ||
if (verbose) warning(paste("Using user supplied", deparse(deparse1(substitute(x))), "instead of deriving."), call. = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more comment: I am fine with default showing this message, but in our production environment I prefer this not be shown as warnings, since it is intended behavior, and I am logging all warnings and errors when automatically merging data each day on our server.
Can we use something like cat("....\n")
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.