Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General multiple replicate support for pointwise statistic distributions #86

Open
8 tasks
jhseeman opened this issue Jul 11, 2024 · 4 comments
Open
8 tasks

Comments

@jhseeman
Copy link
Collaborator

Background

syntheval currently uses one replicate for each evaluation, which obscures the critical effect of randomness in assessing synthetic data disclosure risk and utility. This issue would update syntheval to work with multiple replicates that enables empirical assessment of this randomness, independent of what form it might take. Here, we focus on updating existing metrics for collections of pointwise statistics, although working with multiple replicates introduces new possibilities for other metrics.

Design changes

Currently, functions in syntheval accept either postsynth or tibble / data.frame. There are two approaches we could take here:

  1. Create new functions using the _multirep suffix (ex: util_ci_overlap_multirep()) that explicitly handle multiple replicate logic.
  2. Modify the existing functions to additionally accept list[postsynth] or list[tibble] / list[data.frame]

I'm personally in favor of option 1 for the following reasons:

  • Option 1 allows for easier parallelization and avoids potential memory issues from recursion in Option 2.
  • Option 2 could produce long functions that aren't modular, especially if the logic for multiple replicates differs significantly from single replicates.

Open to suggestions / feedback here!

Pointwise statistic distributions

The following methods admit straightforward analogues using multiple replicates by producing distributions of a collection of pointwise statistics:

  • util_ci_overlap.R
  • util_co_occurrence.R
  • util_ks_distance.R
  • util_moments.R
  • util_percentiles.R
  • util_proportions.R
  • util_tails.R
  • util_totals.R

For each pointwise statistic (eventually a row) in the one-replicate case, we replace it with distributional summary statistics in the multiple replicate case. Here's an example for util_moments()output:

# A tibble: ? × 8
  variable statistic original synth_min   synth_q1  synth_med  synth_q3  synth_max 
  <fct>    <fct>       <dbl>   <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 x1       mean        0.1     -0.5       -0.3       0.1       0.4        1.2 
2 x1       mean_diff   0.0     -0.6       -0.2       0.0       0.3        1.1
# etc ...

We can also include an optional argument (akin to simplify=FALSE) that simply returns the evaluation
metric applied to each replicate.

Metric-specific considerations:

  • Wide vs. long format: some outputs are currently in wider format (ex: statistic names are listed as columns instead of rows, like above for mean differences) that would need to pivot to longer.
  • Non-tabular format: some outputs are currently in non-tabular format (ex: correlation matrices) that
    would need to be converted to/from the format above.
@awunderground
Copy link
Contributor

How does option 1 make parallelization easier and avoid memory issues?

I worry about doubling the number of functions. What if we:

  1. Create _backend functions like util_moments_backend() that we don't export.
  2. Simplify the existing functions like util_moments() to include methods for individual syntheses and multiple replicates that call the _backend functions.

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

@jhseeman
Copy link
Collaborator Author

I worry about doubling the number of functions. What if we:

1. Create `_backend` functions like `util_moments_backend()` that we don't export.

2. Simplify the existing functions like `util_moments()` to include methods for individual syntheses and multiple replicates that call the `_backend` functions.

I like this idea - I guess we're still doubling the number of total functions, but the public functions stay the same!

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Yep, definitely some shared utilities here.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

Will take a look - are there plans to incorporate similar group-by logic in other places? Also not sure about the current state / plans for that PR

@awunderground
Copy link
Contributor

I think that PR is close but it would need to be resurrected, which is something I haven't considered.

@awunderground
Copy link
Contributor

Be careful: tibbles are lists and could lead to some confusion in code you write.

I think we should create a multipostsynth class in tidysynthesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants