Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize metadata merging with data #104

Merged
merged 24 commits into from
Oct 15, 2024
Merged

Conversation

jthompson-arcus
Copy link
Collaborator

No description provided.

@jthompson-arcus
Copy link
Collaborator Author

@LDSamson & @aclark02-arcus, what are your opinions on this approach to study specific modifications in merge_meta_with_data()?

It obviously will need to be extended out further before merging and should probably not have a default value. But this approach will allow us to do more data checking on the process. Much better than just piping additional changes onto the end of merge_meta_with_data(), although we have no real way of stopping a user from doing that.

@LDSamson
Copy link
Collaborator

LDSamson commented Oct 9, 2024

Yes, I thought of going into this direction.

Do you think we need to implement multiple functions, or can just use one function? In the latter case, we can maybe provide the function name as a string, and just do something like do.call("custom_function_study_x", list(merged_data)).

In this case, we could define the custom function name in either the config.yml or in the metadata.

What do you think?

@jthompson-arcus
Copy link
Collaborator Author

I like the idea of being able to pass multiple functions. This would allow an org to segment that work up. (e.g. Some modifications might be needed for only one study, but it shares modifications with other studies. Then the additional mods could just be piped onto the general function).

Since merge_meta_with_data() is a pre-processing step. I think it makes more sense to have the user manually add the functions in the call itself. I'm not entirely opposed to expanding the metadata, but ultimately we would just be capturing the function name.

If part of the motivation is to utilize base R over purrr, I can rewrite the code in base R.

@jthompson-arcus
Copy link
Collaborator Author

I do see some benefit in adding it to the configuration file. Since the pre-processing is probably happening in the same active configuration that the app is deployed in, it could allow an organization to have a uniform pre-processing script where the difference is the active configuration.

@LDSamson
Copy link
Collaborator

LDSamson commented Oct 9, 2024

Thanks for describing the use case. I think that makes sense indeed, to be able to use multiple functions.

For our internal use, it would be great if we can pass the name of the custom functions directly as something like a character vector.
I would further have a preference for using base R, if the implementation in base R is not too difficult.

Regarding adding it to the config file: currently for our deployment, we are keeping the config.yml the same between studies (although this might change in the future). If we add the pre-processing functions, wouldn't it be easier to add them in the metadata since the functions will change with each study?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you guys think about this change. Basic idea is that there is a sheet in the metadata file where settings can be stored. Since settings could have more than one values, the default data frame is simplified to a list with NA's removed.

R/fct_appdata.R Outdated
@@ -112,7 +112,9 @@ merge_meta_with_data <- function(
"item_value" = VAL,
"reason_notdone" = LBREASND
) |>
apply_study_specific_fixes()
Reduce(\(x1, x2) do.call(x2, list(x1)), # Apply next function to output of previous
meta$settings$add_fns %||% "identity", # Return merged data if no additional functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to set the default to apply_study_specific_fixes instead of identity until we added more custom settings in the metadata, and we phase out the apply_study_specific_fixes function? Maybe we find a more fitting name than 'settings' if we implement more functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I want to generalize the merge_meta_with_data() function further in this PR. I just wanted to get input on how we wanted to allow custom modification. I see a few other things I want to complete in this PR.

  1. Generalize `add_timevars_to_data().
  2. Prioritize user input (e.g. If the user has already supplied event_name, region, etc., the app should prioritize that value over imputation).
  3. Add checkers at the end of the function call. Ensure that needed imputed variables are created and valid.

But if you would rather us break this into multiple PRs, I can change the default and clean things up for a review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this PR larger, adding multiple generalizations (and keeping the 'identity' in that case), is also okay for me.

@LDSamson
Copy link
Collaborator

I like this direction. Just brainstorming, not sure if the following would be the right way:

I would see that we have a general set of custom functions that we can reuse, that would be nice. One for each variable that we want to adjust.

I am just thinking: what if we could also set arguments? Or would that become too difficult?

Something like this in the settings Excel tab:

add_fns arg_name arg_val
apply_study_specific_fixes form_id_vars subject_id, event_name, item_group
  other_var test1, test2, test3
another_function    
third_function important_argument important_setting

If we do this, we should probably create a helper function to robustly add the correct arguments to the right functions. A more robust version of below, for each function:

setNames(unlist(strsplit(gsub(" ", "", na.omit(meta$settings$arg_val)), split = ", ")), na.omit(meta$settings$arg_name))

Alternative is to be able to set function arguments of the custom functions in environment variables or so (maybe easier)?

@jthompson-arcus
Copy link
Collaborator Author

I want to keep it simple for now. It's fairly simple for a user to create wrappers of the base function with different argument defaults. I fear that adding this complexity will make the function hard to maintain and buggy. Plus right now the error produced is pretty straightforward to the end user.

Plus I was thinking we could expand the setting tab to allow user modification elsewhere in the application. Such as defaults for form tables.

@LDSamson
Copy link
Collaborator

Yes that makes sense, now I thought a bit longer about it.

Regarding regions: see also #106

R/fct_data_helpers.R Outdated Show resolved Hide resolved
@jthompson-arcus
Copy link
Collaborator Author

@LDSamson I think I want to cut off this PR here. I did not really add any additional checks (in part because it's not clear what those checks should be). I think checks and documentation on function usage (maybe a vignette?) can be added in a follow up PR.

@LDSamson
Copy link
Collaborator

I detected some differences when merging real study data with metadata with the old vs the new method. I will try to find out what was the issue..

R/fct_appdata.R Outdated Show resolved Hide resolved
R/fct_appdata.R Outdated Show resolved Hide resolved
R/fct_data_helpers.R Outdated Show resolved Hide resolved
@jthompson-arcus
Copy link
Collaborator Author

@LDSamson did you discover where you local difference were coming from?

Comment on lines 56 to 72
#' If x does not exist, return y, otherwise return x
#'
#' @param x,y two elements to test, one potentially not existent
#' @param verbose logical, indicating whether warning message should be displayed.
#'
#' @noRd
#'
#' @examples
#' mtcars2 %|_|% mtcars
"%|_|%" <- function(x, y, verbose = TRUE) {
if (exists(deparse1(substitute(x)))) {
if (verbose) warning(paste("Using user supplied", deparse(deparse1(substitute(x))), "instead of deriving."), call. = FALSE)
x
} else {
y
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue of this not working within dplyr::mutate() is that it looks for existence in the incorrect environment.

Maybe this is a solution, can you check if this can be used in all cases?

exists(...., envir = parent.frame() )

Once you get it working, it would be great if you can also add small unit tests where you test it in different environments (e.g. in a global environment and in something a dplyr::mutate() call).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work. I'm trying to think if there is any situation in which it would not work. Seems like the parent frame is the environment this function would always be evaluated inside of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can come up with a situation where using exists(..., envir = parent.frame()) will produce the wrong result, the last commit should resolve this.

Copy link
Collaborator

@LDSamson LDSamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. Most important is to get the function %|_|% working within mutate()

@jthompson-arcus
Copy link
Collaborator Author

@LDSamson I'm going to add a test for %|_|% in the test-golem_utils_server.R file.

#' mtcars2 %|_|% mtcars
"%|_|%" <- function(x, y, verbose = TRUE) {
if (exists(deparse1(substitute(x)), envir = parent.frame())) {
if (verbose) warning(paste("Using user supplied", deparse(deparse1(substitute(x))), "instead of deriving."), call. = FALSE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more comment: I am fine with default showing this message, but in our production environment I prefer this not be shown as warnings, since it is intended behavior, and I am logging all warnings and errors when automatically merging data each day on our server.

Can we use something like cat("....\n") instead?

Copy link
Collaborator

@LDSamson LDSamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LDSamson LDSamson merged commit a4d88ba into dev Oct 15, 2024
2 checks passed
@LDSamson LDSamson deleted the jt-generalize_meta_merging branch October 15, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants