General multiple replicate support for pointwise statistic distributions #86

jhseeman · 2024-07-11T15:28:47Z

Background

syntheval currently uses one replicate for each evaluation, which obscures the critical effect of randomness in assessing synthetic data disclosure risk and utility. This issue would update syntheval to work with multiple replicates that enables empirical assessment of this randomness, independent of what form it might take. Here, we focus on updating existing metrics for collections of pointwise statistics, although working with multiple replicates introduces new possibilities for other metrics.

Design changes

Currently, functions in syntheval accept either postsynth or tibble / data.frame. There are two approaches we could take here:

Create new functions using the _multirep suffix (ex: util_ci_overlap_multirep()) that explicitly handle multiple replicate logic.
Modify the existing functions to additionally accept list[postsynth] or list[tibble] / list[data.frame]

I'm personally in favor of option 1 for the following reasons:

Option 1 allows for easier parallelization and avoids potential memory issues from recursion in Option 2.
Option 2 could produce long functions that aren't modular, especially if the logic for multiple replicates differs significantly from single replicates.

Open to suggestions / feedback here!

Pointwise statistic distributions

The following methods admit straightforward analogues using multiple replicates by producing distributions of a collection of pointwise statistics:

For each pointwise statistic (eventually a row) in the one-replicate case, we replace it with distributional summary statistics in the multiple replicate case. Here's an example for util_moments()output:

# A tibble: ? × 8
  variable statistic original synth_min   synth_q1  synth_med  synth_q3  synth_max 
  <fct>    <fct>       <dbl>   <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 x1       mean        0.1     -0.5       -0.3       0.1       0.4        1.2 
2 x1       mean_diff   0.0     -0.6       -0.2       0.0       0.3        1.1
# etc ...

We can also include an optional argument (akin to simplify=FALSE) that simply returns the evaluation
metric applied to each replicate.

Metric-specific considerations:

Wide vs. long format: some outputs are currently in wider format (ex: statistic names are listed as columns instead of rows, like above for mean differences) that would need to pivot to longer.
Non-tabular format: some outputs are currently in non-tabular format (ex: correlation matrices) that
would need to be converted to/from the format above.

The text was updated successfully, but these errors were encountered:

awunderground · 2024-07-16T18:21:12Z

How does option 1 make parallelization easier and avoid memory issues?

I worry about doubling the number of functions. What if we:

Create _backend functions like util_moments_backend() that we don't export.
Simplify the existing functions like util_moments() to include methods for individual syntheses and multiple replicates that call the _backend functions.

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

jhseeman · 2024-07-16T18:29:12Z

I worry about doubling the number of functions. What if we:

1. Create `_backend` functions like `util_moments_backend()` that we don't export.

2. Simplify the existing functions like `util_moments()` to include methods for individual syntheses and multiple replicates that call the `_backend` functions.

I like this idea - I guess we're still doubling the number of total functions, but the public functions stay the same!

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Yep, definitely some shared utilities here.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

Will take a look - are there plans to incorporate similar group-by logic in other places? Also not sure about the current state / plans for that PR

awunderground · 2024-07-16T18:52:07Z

I think that PR is close but it would need to be resurrected, which is something I haven't considered.

awunderground · 2024-07-16T18:53:48Z

Be careful: tibbles are lists and could lead to some confusion in code you write.

I think we should create a multipostsynth class in tidysynthesis.

jhseeman mentioned this issue Jul 17, 2024

Standardize inputs for syntheval metrics #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General multiple replicate support for pointwise statistic distributions #86

General multiple replicate support for pointwise statistic distributions #86

jhseeman commented Jul 11, 2024

awunderground commented Jul 16, 2024

jhseeman commented Jul 16, 2024

awunderground commented Jul 16, 2024

awunderground commented Jul 16, 2024

General multiple replicate support for pointwise statistic distributions #86

General multiple replicate support for pointwise statistic distributions #86

Comments

jhseeman commented Jul 11, 2024

Background

Design changes

Pointwise statistic distributions

awunderground commented Jul 16, 2024

jhseeman commented Jul 16, 2024

awunderground commented Jul 16, 2024

awunderground commented Jul 16, 2024