-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General multiple replicate support for pointwise statistic distributions #86
Comments
How does option 1 make parallelization easier and avoid memory issues? I worry about doubling the number of functions. What if we:
Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions ( Elyse worked on grouping for |
I like this idea - I guess we're still doubling the number of total functions, but the public functions stay the same!
Yep, definitely some shared utilities here.
Will take a look - are there plans to incorporate similar group-by logic in other places? Also not sure about the current state / plans for that PR |
I think that PR is close but it would need to be resurrected, which is something I haven't considered. |
Be careful: tibbles are lists and could lead to some confusion in code you write. I think we should create a |
Background
syntheval
currently uses one replicate for each evaluation, which obscures the critical effect of randomness in assessing synthetic data disclosure risk and utility. This issue would updatesyntheval
to work with multiple replicates that enables empirical assessment of this randomness, independent of what form it might take. Here, we focus on updating existing metrics for collections of pointwise statistics, although working with multiple replicates introduces new possibilities for other metrics.Design changes
Currently, functions in
syntheval
accept eitherpostsynth
ortibble
/data.frame
. There are two approaches we could take here:_multirep
suffix (ex:util_ci_overlap_multirep()
) that explicitly handle multiple replicate logic.list[postsynth]
orlist[tibble]
/list[data.frame]
I'm personally in favor of option 1 for the following reasons:
Open to suggestions / feedback here!
Pointwise statistic distributions
The following methods admit straightforward analogues using multiple replicates by producing distributions of a collection of pointwise statistics:
util_ci_overlap.R
util_co_occurrence.R
util_ks_distance.R
util_moments.R
util_percentiles.R
util_proportions.R
util_tails.R
util_totals.R
For each pointwise statistic (eventually a row) in the one-replicate case, we replace it with distributional summary statistics in the multiple replicate case. Here's an example for
util_moments()
output:We can also include an optional argument (akin to
simplify=FALSE
) that simply returns the evaluationmetric applied to each replicate.
Metric-specific considerations:
would need to be converted to/from the format above.
The text was updated successfully, but these errors were encountered: