Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample code for clustered and nonclustered bootstrapping #168

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jmgirard
Copy link

@jmgirard jmgirard commented Apr 20, 2021

cor_boot(iris, "Sepal.Length", "Petal.Length")
#>    Parameter1   Parameter2         r         se    CI_low   CI_high  Method
#> 1 Sepal.Length Petal.Length 0.8717538 0.01713286 0.8292761 0.8988949 pearson
cor_boot(iris, "Sepal.Length", "Petal.Length", cluster = "Species")
#>    Parameter1   Parameter2         r        se   CI_low   CI_high  Method
#> 1 Sepal.Length Petal.Length 0.8717538 0.1158396 0.754049 0.9007654 pearson

@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2021

Codecov Report

Attention: Patch coverage is 0% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 86.68%. Comparing base (d641db4) to head (81c3ee4).

Current head 81c3ee4 differs from pull request most recent head 8615ecc

Please upload reports for the commit 8615ecc to get more accurate results.

Files Patch % Lines
R/cor_boot.R 0.00% 42 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #168       +/-   ##
===========================================
+ Coverage   70.00%   86.68%   +16.67%     
===========================================
  Files          44       39        -5     
  Lines        2054     1367      -687     
===========================================
- Hits         1438     1185      -253     
+ Misses        616      182      -434     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DominiqueMakowski
Copy link
Member

cool! For consistency, could we name the se col as SE, and also can we replace the call to tidyr? (we try to keep dependencies to a minimum throughout easystats). And feel free to add yourself as a contributor :)

@jmgirard
Copy link
Author

I was having trouble finding the base R version of nest() and unnest() but I'll do a bit more searching. And yes we can change the names of whatever you want.

@strengejacke
Copy link
Member

Your primary aim is not to nest data, but rather to split at factor levels, I think, If so, you could use:

x <- split(bs_data, bs_data$cluster)

@strengejacke

This comment has been minimized.

@strengejacke
Copy link
Member

Here's a (way faster) base R solution.

library(tictoc)

tic("tidyr")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- tidyr::nest(bs_data, data = -c(Species))


## Cluster Bootstrap
clusterboot_tidyr <- function(bs_data, method, R, ...) {
  boot::boot(
    data = bs_data,
    statistic = clusterboot_stat_tidyr,
    R = R,
    method = "pearson",
    ...
  )
}

clusterboot_stat_tidyr <- function(data, index, method) {
  resample <- data[index, ]
  resample <- tidyr::unnest(resample, cols = data)
  resample <- resample[-1]
  stats::cor(resample, method = method)
}


set.seed(123)
bs_results <- clusterboot_tidyr(
  bs_data = bs_data,
  method = method,
  R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> tidyr: 3.92 sec elapsed



tic("split")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- split(bs_data, bs_data$Species)
# clustered bootstrap

## Cluster Bootstrap
clusterboot_split <- function(bs_data, method, R, ...) {
  boot::boot(
    data = bs_data,
    statistic = clusterboot_stat_split,
    R = R,
    method = "pearson",
    ...
  )
}

clusterboot_stat_split <- function(data, index, method) {
  resample <- do.call(rbind, data[index])
  stats::cor(resample[, 1:2], method = method)
}
  
set.seed(123)
bs_results <- clusterboot_split(
  bs_data = bs_data,
  method = method,
  R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> split: 0.55 sec elapsed

Created on 2021-04-20 by the reprex package (v2.0.0)

@jmgirard
Copy link
Author

jmgirard commented Apr 20, 2021

I'm not sure what clustered bootstrap correlations should do: bootstrap within clusters, and then correlate the complete data? What if we have unequal cluster sizes?

The idea is to bootstrap (sample with replacement and same size) entire clusters including all the observations within each. So not really bootstrap within clusters but across them. Then you calculate the correlation for each resample.

Field, C. A., & Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B, 69(3), 369–390. https://doi.org/10/cqwx5p

Ren, S., Lai, H., Tong, W., Aminzadeh, M., Hou, X., & Lai, S. (2010). Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37(9), 1487–1498. https://doi.org/10/dvfzcn

@jmgirard jmgirard closed this Apr 20, 2021
@jmgirard jmgirard reopened this Apr 20, 2021
@bwiernik
Copy link
Contributor

Yes, resampling clusters tends to work remarkably well according to the econometrics lit.

@jmgirard
Copy link
Author

do.call(rbind, data[index]) on split data was clever. And I confirmed that the bootstrap SE was the same:

sd(bs_results_tidyr$t[, 2])
#> [1] 0.1203166
sd(bs_results_split$t[, 2])
#> [1] 0.1203166

@DominiqueMakowski
Copy link
Member

does that mean that this approach is virtually compatible with any other correlation methods? As in one could do correlation(iris, boot_cluster = "Species", method = "gamma")?

@jmgirard
Copy link
Author

jmgirard commented Apr 20, 2021

does that mean that this approach is virtually compatible with any other correlation methods? As in one could do correlation(iris, boot_cluster = "Species", method = "gamma")?

Yes, exactly. I'm not sure about the Bayesian methods but it should be compatible with any frequentist approach. We will just have to extend this to call the appropriate correlation-calculating function. Ideally, we would also calculate all requested correlations in a single bootstrapping. This would just require pulling more than one result from bs_results.

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented Apr 20, 2021

Nice, API-wise, would it make sense having .cor_boot() or .boot_correlation() as an internal function, which would be run by cor_test() if the argument boot_cluster is not NULL or FALSE. Essentially, we only add a single argument (boot_cluster = FALSE) to cor_test() (and the upstream correlation()) to toggle the bootstrapping and specify which columns?

@jmgirard
Copy link
Author

jmgirard commented Apr 20, 2021

I'll defer to you all on the API. But note that you will likely need more than one argument. Whether to use bootstrapping, the number of bootstrap resamples (typically 1000-2000), whether to use the cluster or single bootstrap (can be determined by whether a cluster variable is given), and optionally the type of bootstrap CI to calculate (e.g., percentile or BCA). That is, I recommend making both cluster bootstrapping an option but also just single bootstrapping as an alternative to normal approximation.

@jmgirard
Copy link
Author

There are also ways of getting a p-value from bootstrapping that I could add if desired. But I'm not as confident in my understanding of that, so I'd want some review of the code.

@bwiernik
Copy link
Contributor

BCA intervals tend to work poorly for cluster bootstrap. Bootstrapping with a pivot like the Fisher z seems to work better with clustered data eg, see here--eg, to resample clusters, then compute the Fisher z transformed correlation to construct a distribution and compute quantiles/p values, then back transform to correlation. For clustered bootstrap, we may want to not permit BCA or at least give a Message alerting users, along with defaulting to bootstrap-z.

How are bootstrap statistics currently implemented with the iter parameter?

@jmgirard
Copy link
Author

@bwiernik Can you confirm that in bootstrap-z, you calculate the bootstrap SE by (1) taking the standard deviation of the z-scores and then transforming this value back to r-scores, or (2) transforming all z-scores back to r-scores and then taking the standard deviation of the r-scores?

@bwiernik
Copy link
Contributor

Construct intervals on z, then transform points back to r

@jmgirard
Copy link
Author

jmgirard commented Apr 21, 2021

Construct intervals on z, then transform points back to r

I got the intervals correct, but I'm still unsure about the standard error. Can you weigh in?
Two other questions:

  1. Do you have a preferred method of calculating p-values from bootstrap distributions?
  2. Do you think it would be worthwhile to add some variant of the wild bootstrap here?

@jmgirard
Copy link
Author

jmgirard commented Apr 28, 2021

I've been doing more reading on this topic.

I think the bootstrap SE doesn't really matter so much if our goal is to provide BCa intervals for non-clustered bootstraps and quantile intervals for clustered bootstraps.

For p-values, it seems that a univariate sampling may be preferable to the typical bivariate sampling especially for smaller samples (Lee & Rodgers, 1998), although this disconnects the p-value from the intervals, which is somewhat awkward.

My understanding of the wild bootstrap is that it is helpful when bootstrapping linear models but is not applicable when there aren't residuals as in our use case of correlation coefficients.

There's also the idea of iterated bootstrapping (Hall et al., 1989) which seemed to improve the intervals a lot in non-clustered cases, although maybe BCa achieves a similar goal.


Hall, P., Martin, M. A., & Schucany, W. R. (1989). Better nonparametric bootstrap confidence intervals for the correlation coefficient. Journal of Statistical Computation and Simulation, 33(3), 161–172. https://doi.org/10/bx23n5

Lee, W.-C., & Rodgers, J. L. (1998). Bootstrapping correlation coefficients using univariate and bivariate sampling. Psychological Methods, 3(1), 91–103. https://doi.org/10/bh2qvj

@DominiqueMakowski
Copy link
Member

What's the status of this?

@bwiernik
Copy link
Contributor

What do we think for reporting the standard errors when transformations are applied?

If we bootstrap in the z metric, then transform the bounds of CI back to r, should we report the SE in z or r?

Arguments for z is that it is the SE used to compute the intervals and that the SE isn't as meaningful on the skewed r scale.

Arguments for r is that it is the same metric as the other results.

Stats packages differ on their behavior. Stata delta method transforms the SE. Some R packages (and SAS I think) leave it untransformed. Other R packages drop it entirely from output.

What do we do when we exponentiate coefficients for a logistic regression?

@jmgirard
Copy link
Author

I was leaning towards omitting the SE but could be persuaded otherwise. A larger issue seems to be that bootstrapping in the package in general would be best if there were smaller functions that just take in a matrix (or whatever) of "ready to analyze" data and output only the point estimate. My understanding is the package is not currently written this way.

@bwiernik
Copy link
Contributor

I would like to make a consistent decision about transformed variables and SEs. The bootstrap question isn't that big a concern to me--cf. we report the SD or MAD of an MCMC distribution even though we don't use that for intervals.

@jmgirard
Copy link
Author

Would it be too much work to allow the user to toggle between either transformed or untransformed versions?

Do you agree that adding bootstrapping to the package would require/be best if we made a highly minimal point estimate function for each type of correlation (and therefore put the estimation of uncertainty in separate functions)?

@bwiernik
Copy link
Contributor

I see no value in providing output in the z metric for example

@jmgirard
Copy link
Author

I see no value in providing output in the z metric for example

I meant toggle for the metric of the SE. Didn't you provide some examples for why the z metric SE had value?

@bwiernik
Copy link
Contributor

Z is never a useful metric for interpreting. Some programs output the SE of beta coefficients for logistic regression, that at least is arguably useful

@jmgirard
Copy link
Author

jmgirard commented Jul 2, 2021

Do you agree that adding bootstrapping to the package would require/be best if we made a highly minimal point estimate function for each type of correlation (and therefore put the estimation of uncertainty in separate functions)?

I can't really move forward without an answer from a package maintainer on this.

@bwiernik
Copy link
Contributor

bwiernik commented Jul 3, 2021

Yes, that would be good.

@strengejacke
Copy link
Member

@IndrajeetPatil @DominiqueMakowski Are you able to respond to @jmgirard (see #168 (comment))?

@bwiernik
Copy link
Contributor

@jmgirard yes, a highly minimal point estimate internal function for each type of correlation would be good

@jmgirard
Copy link
Author

@bwiernik Ok, if this is approved, I can help to refactor a few of the most common correlation functions to permit bootstrapping. I would, of course, appreciate feedback on the approach.

@bwiernik
Copy link
Contributor

That would be great. Let me know what sort of feedback you'd like

@IndrajeetPatil
Copy link
Member

@jmgirard Any update on this? Do you still wish to continue working on this?

Let us know if you are waiting for something from our end that is blocking your progress.

@jmgirard
Copy link
Author

Sorry, I just got really busy. I am still interested but my progress may be slow.

@IndrajeetPatil
Copy link
Member

No worries!

Just wanted to make that we weren't the roadblock 🙃

Take your time!

@IndrajeetPatil IndrajeetPatil removed their request for review October 24, 2024 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants