Add sample code for clustered and nonclustered bootstrapping #168

jmgirard · 2021-04-20T02:05:54Z

cor_boot(iris, "Sepal.Length", "Petal.Length")
#>    Parameter1   Parameter2         r         se    CI_low   CI_high  Method
#> 1 Sepal.Length Petal.Length 0.8717538 0.01713286 0.8292761 0.8988949 pearson
cor_boot(iris, "Sepal.Length", "Petal.Length", cluster = "Species")
#>    Parameter1   Parameter2         r        se   CI_low   CI_high  Method
#> 1 Sepal.Length Petal.Length 0.8717538 0.1158396 0.754049 0.9007654 pearson

codecov-commenter · 2021-04-20T02:08:11Z

Codecov Report

Attention: Patch coverage is 0% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 86.68%. Comparing base (d641db4) to head (81c3ee4).

❗ Current head 81c3ee4 differs from pull request most recent head 8615ecc

Please upload reports for the commit 8615ecc to get more accurate results.

Files	Patch %	Lines
R/cor_boot.R	0.00%	42 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #168       +/-   ##
===========================================
+ Coverage   70.00%   86.68%   +16.67%     
===========================================
  Files          44       39        -5     
  Lines        2054     1367      -687     
===========================================
- Hits         1438     1185      -253     
+ Misses        616      182      -434

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DominiqueMakowski · 2021-04-20T02:14:55Z

cool! For consistency, could we name the se col as SE, and also can we replace the call to tidyr? (we try to keep dependencies to a minimum throughout easystats). And feel free to add yourself as a contributor :)

jmgirard · 2021-04-20T02:35:41Z

I was having trouble finding the base R version of nest() and unnest() but I'll do a bit more searching. And yes we can change the names of whatever you want.

strengejacke · 2021-04-20T05:56:40Z

Your primary aim is not to nest data, but rather to split at factor levels, I think, If so, you could use:

x <- split(bs_data, bs_data$cluster)

strengejacke · 2021-04-20T06:48:56Z

Here's a (way faster) base R solution.

library(tictoc)

tic("tidyr")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- tidyr::nest(bs_data, data = -c(Species))


## Cluster Bootstrap
clusterboot_tidyr <- function(bs_data, method, R, ...) {
  boot::boot(
    data = bs_data,
    statistic = clusterboot_stat_tidyr,
    R = R,
    method = "pearson",
    ...
  )
}

clusterboot_stat_tidyr <- function(data, index, method) {
  resample <- data[index, ]
  resample <- tidyr::unnest(resample, cols = data)
  resample <- resample[-1]
  stats::cor(resample, method = method)
}


set.seed(123)
bs_results <- clusterboot_tidyr(
  bs_data = bs_data,
  method = method,
  R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> tidyr: 3.92 sec elapsed



tic("split")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- split(bs_data, bs_data$Species)
# clustered bootstrap

## Cluster Bootstrap
clusterboot_split <- function(bs_data, method, R, ...) {
  boot::boot(
    data = bs_data,
    statistic = clusterboot_stat_split,
    R = R,
    method = "pearson",
    ...
  )
}

clusterboot_stat_split <- function(data, index, method) {
  resample <- do.call(rbind, data[index])
  stats::cor(resample[, 1:2], method = method)
}
  
set.seed(123)
bs_results <- clusterboot_split(
  bs_data = bs_data,
  method = method,
  R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> split: 0.55 sec elapsed

^{Created on 2021-04-20 by the reprex package (v2.0.0)}

jmgirard · 2021-04-20T10:07:54Z

I'm not sure what clustered bootstrap correlations should do: bootstrap within clusters, and then correlate the complete data? What if we have unequal cluster sizes?

The idea is to bootstrap (sample with replacement and same size) entire clusters including all the observations within each. So not really bootstrap within clusters but across them. Then you calculate the correlation for each resample.

Field, C. A., & Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B, 69(3), 369–390. https://doi.org/10/cqwx5p

Ren, S., Lai, H., Tong, W., Aminzadeh, M., Hou, X., & Lai, S. (2010). Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37(9), 1487–1498. https://doi.org/10/dvfzcn

bwiernik · 2021-04-20T10:29:28Z

Yes, resampling clusters tends to work remarkably well according to the econometrics lit.

jmgirard · 2021-04-20T10:36:11Z

do.call(rbind, data[index]) on split data was clever. And I confirmed that the bootstrap SE was the same:

sd(bs_results_tidyr$t[, 2])
#> [1] 0.1203166
sd(bs_results_split$t[, 2])
#> [1] 0.1203166

DominiqueMakowski · 2021-04-20T10:50:12Z

does that mean that this approach is virtually compatible with any other correlation methods? As in one could do correlation(iris, boot_cluster = "Species", method = "gamma")?

jmgirard · 2021-04-20T10:58:26Z

does that mean that this approach is virtually compatible with any other correlation methods? As in one could do correlation(iris, boot_cluster = "Species", method = "gamma")?

Yes, exactly. I'm not sure about the Bayesian methods but it should be compatible with any frequentist approach. We will just have to extend this to call the appropriate correlation-calculating function. Ideally, we would also calculate all requested correlations in a single bootstrapping. This would just require pulling more than one result from bs_results.

DominiqueMakowski · 2021-04-20T11:02:15Z

Nice, API-wise, would it make sense having .cor_boot() or .boot_correlation() as an internal function, which would be run by cor_test() if the argument boot_cluster is not NULL or FALSE. Essentially, we only add a single argument (boot_cluster = FALSE) to cor_test() (and the upstream correlation()) to toggle the bootstrapping and specify which columns?

jmgirard · 2021-04-20T11:15:32Z

I'll defer to you all on the API. But note that you will likely need more than one argument. Whether to use bootstrapping, the number of bootstrap resamples (typically 1000-2000), whether to use the cluster or single bootstrap (can be determined by whether a cluster variable is given), and optionally the type of bootstrap CI to calculate (e.g., percentile or BCA). That is, I recommend making both cluster bootstrapping an option but also just single bootstrapping as an alternative to normal approximation.

jmgirard · 2021-04-20T11:23:18Z

There are also ways of getting a p-value from bootstrapping that I could add if desired. But I'm not as confident in my understanding of that, so I'd want some review of the code.

bwiernik · 2021-04-20T12:49:20Z

BCA intervals tend to work poorly for cluster bootstrap. Bootstrapping with a pivot like the Fisher z seems to work better with clustered data eg, see here--eg, to resample clusters, then compute the Fisher z transformed correlation to construct a distribution and compute quantiles/p values, then back transform to correlation. For clustered bootstrap, we may want to not permit BCA or at least give a Message alerting users, along with defaulting to bootstrap-z.

How are bootstrap statistics currently implemented with the iter parameter?

… is ignored) in correlation

…into cluster-boot # Conflicts: # R/cor_boot.R

jmgirard · 2021-04-20T17:12:42Z

@bwiernik Can you confirm that in bootstrap-z, you calculate the bootstrap SE by (1) taking the standard deviation of the z-scores and then transforming this value back to r-scores, or (2) transforming all z-scores back to r-scores and then taking the standard deviation of the r-scores?

bwiernik · 2021-04-20T21:25:37Z

Construct intervals on z, then transform points back to r

jmgirard · 2021-04-21T17:34:50Z

Construct intervals on z, then transform points back to r

I got the intervals correct, but I'm still unsure about the standard error. Can you weigh in?
Two other questions:

Do you have a preferred method of calculating p-values from bootstrap distributions?
Do you think it would be worthwhile to add some variant of the wild bootstrap here?

jmgirard · 2021-04-28T18:03:06Z

I've been doing more reading on this topic.

I think the bootstrap SE doesn't really matter so much if our goal is to provide BCa intervals for non-clustered bootstraps and quantile intervals for clustered bootstraps.

For p-values, it seems that a univariate sampling may be preferable to the typical bivariate sampling especially for smaller samples (Lee & Rodgers, 1998), although this disconnects the p-value from the intervals, which is somewhat awkward.

My understanding of the wild bootstrap is that it is helpful when bootstrapping linear models but is not applicable when there aren't residuals as in our use case of correlation coefficients.

There's also the idea of iterated bootstrapping (Hall et al., 1989) which seemed to improve the intervals a lot in non-clustered cases, although maybe BCa achieves a similar goal.

Hall, P., Martin, M. A., & Schucany, W. R. (1989). Better nonparametric bootstrap confidence intervals for the correlation coefficient. Journal of Statistical Computation and Simulation, 33(3), 161–172. https://doi.org/10/bx23n5

Lee, W.-C., & Rodgers, J. L. (1998). Bootstrapping correlation coefficients using univariate and bivariate sampling. Psychological Methods, 3(1), 91–103. https://doi.org/10/bh2qvj

DominiqueMakowski · 2021-06-25T01:13:50Z

What's the status of this?

bwiernik · 2021-06-25T12:09:58Z

What do we think for reporting the standard errors when transformations are applied?

If we bootstrap in the z metric, then transform the bounds of CI back to r, should we report the SE in z or r?

Arguments for z is that it is the SE used to compute the intervals and that the SE isn't as meaningful on the skewed r scale.

Arguments for r is that it is the same metric as the other results.

Stats packages differ on their behavior. Stata delta method transforms the SE. Some R packages (and SAS I think) leave it untransformed. Other R packages drop it entirely from output.

What do we do when we exponentiate coefficients for a logistic regression?

jmgirard · 2021-06-25T14:32:28Z

I was leaning towards omitting the SE but could be persuaded otherwise. A larger issue seems to be that bootstrapping in the package in general would be best if there were smaller functions that just take in a matrix (or whatever) of "ready to analyze" data and output only the point estimate. My understanding is the package is not currently written this way.

bwiernik · 2021-06-25T15:18:47Z

I would like to make a consistent decision about transformed variables and SEs. The bootstrap question isn't that big a concern to me--cf. we report the SD or MAD of an MCMC distribution even though we don't use that for intervals.

jmgirard · 2021-06-25T15:27:32Z

Would it be too much work to allow the user to toggle between either transformed or untransformed versions?

Do you agree that adding bootstrapping to the package would require/be best if we made a highly minimal point estimate function for each type of correlation (and therefore put the estimation of uncertainty in separate functions)?

bwiernik · 2021-06-25T16:35:10Z

I see no value in providing output in the z metric for example

jmgirard · 2021-06-25T16:41:15Z

I see no value in providing output in the z metric for example

I meant toggle for the metric of the SE. Didn't you provide some examples for why the z metric SE had value?

bwiernik · 2021-06-25T22:01:03Z

Z is never a useful metric for interpreting. Some programs output the SE of beta coefficients for logistic regression, that at least is arguably useful

jmgirard · 2021-07-02T15:38:41Z

Do you agree that adding bootstrapping to the package would require/be best if we made a highly minimal point estimate function for each type of correlation (and therefore put the estimation of uncertainty in separate functions)?

I can't really move forward without an answer from a package maintainer on this.

bwiernik · 2021-07-03T19:56:16Z

Yes, that would be good.

strengejacke · 2021-08-19T06:51:02Z

@IndrajeetPatil @DominiqueMakowski Are you able to respond to @jmgirard (see #168 (comment))?

bwiernik · 2021-08-19T11:47:13Z

@jmgirard yes, a highly minimal point estimate internal function for each type of correlation would be good

jmgirard · 2021-09-21T18:19:03Z

@bwiernik Ok, if this is approved, I can help to refactor a few of the most common correlation functions to permit bootstrapping. I would, of course, appreciate feedback on the approach.

bwiernik · 2021-09-21T18:48:02Z

That would be great. Let me know what sort of feedback you'd like

IndrajeetPatil · 2022-07-31T14:55:58Z

@jmgirard Any update on this? Do you still wish to continue working on this?

Let us know if you are waiting for something from our end that is blocking your progress.

jmgirard · 2022-07-31T15:04:32Z

Sorry, I just got really busy. I am still interested but my progress may be slow.

IndrajeetPatil · 2022-07-31T15:08:22Z

No worries!

Just wanted to make that we weren't the roadblock 🙃

Take your time!

Add sample code for clustered and nonclustered bootstrapping

169af59

This comment has been minimized.

Sign in to view

IndrajeetPatil requested review from IndrajeetPatil and bwiernik April 20, 2021 06:26

jmgirard closed this Apr 20, 2021

jmgirard reopened this Apr 20, 2021

jmgirard and others added 4 commits April 20, 2021 11:51

Replace nest with split and BCA CIs with Percentile CIs

2115d93

make sure only first two variables are selected (and cluster variable…

ca33c8c

… is ignored) in correlation

Add fisher transformation to bootstrap and calculate CIs using z-scores

d12660b

Merge branch 'cluster-boot' of https://github.com/jmgirard/correlation …

f0e45f4

…into cluster-boot # Conflicts: # R/cor_boot.R

Merge branch 'main' into cluster-boot

8615ecc

IndrajeetPatil removed their request for review October 24, 2024 11:00

Add sample code for clustered and nonclustered bootstrapping #168

Are you sure you want to change the base?

Add sample code for clustered and nonclustered bootstrapping #168

Conversation

jmgirard commented Apr 20, 2021 • edited Loading

codecov-commenter commented Apr 20, 2021 • edited by codecov bot Loading

Codecov Report

DominiqueMakowski commented Apr 20, 2021

jmgirard commented Apr 20, 2021

strengejacke commented Apr 20, 2021

This comment has been minimized.

strengejacke commented Apr 20, 2021

jmgirard commented Apr 20, 2021 • edited Loading

bwiernik commented Apr 20, 2021

jmgirard commented Apr 20, 2021

DominiqueMakowski commented Apr 20, 2021

jmgirard commented Apr 20, 2021 • edited Loading

DominiqueMakowski commented Apr 20, 2021 • edited Loading

jmgirard commented Apr 20, 2021 • edited Loading

jmgirard commented Apr 20, 2021

bwiernik commented Apr 20, 2021

jmgirard commented Apr 20, 2021

bwiernik commented Apr 20, 2021

jmgirard commented Apr 21, 2021 • edited Loading

jmgirard commented Apr 28, 2021 • edited Loading

DominiqueMakowski commented Jun 25, 2021

bwiernik commented Jun 25, 2021

jmgirard commented Jun 25, 2021

bwiernik commented Jun 25, 2021

jmgirard commented Jun 25, 2021

bwiernik commented Jun 25, 2021

jmgirard commented Jun 25, 2021

bwiernik commented Jun 25, 2021

jmgirard commented Jul 2, 2021

bwiernik commented Jul 3, 2021

strengejacke commented Aug 19, 2021

bwiernik commented Aug 19, 2021

jmgirard commented Sep 21, 2021

bwiernik commented Sep 21, 2021

IndrajeetPatil commented Jul 31, 2022

jmgirard commented Jul 31, 2022

IndrajeetPatil commented Jul 31, 2022

jmgirard commented Apr 20, 2021 •

edited

Loading

codecov-commenter commented Apr 20, 2021 •

edited by codecov bot

Loading

jmgirard commented Apr 20, 2021 •

edited

Loading

jmgirard commented Apr 20, 2021 •

edited

Loading

DominiqueMakowski commented Apr 20, 2021 •

edited

Loading

jmgirard commented Apr 20, 2021 •

edited

Loading

jmgirard commented Apr 21, 2021 •

edited

Loading

jmgirard commented Apr 28, 2021 •

edited

Loading