-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample code for clustered and nonclustered bootstrapping #168
base: main
Are you sure you want to change the base?
Conversation
jmgirard
commented
Apr 20, 2021
•
edited
Loading
edited
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #168 +/- ##
===========================================
+ Coverage 70.00% 86.68% +16.67%
===========================================
Files 44 39 -5
Lines 2054 1367 -687
===========================================
- Hits 1438 1185 -253
+ Misses 616 182 -434 ☔ View full report in Codecov by Sentry. |
cool! For consistency, could we name the |
I was having trouble finding the base R version of nest() and unnest() but I'll do a bit more searching. And yes we can change the names of whatever you want. |
Your primary aim is not to nest data, but rather to split at factor levels, I think, If so, you could use: x <- split(bs_data, bs_data$cluster) |
This comment has been minimized.
This comment has been minimized.
Here's a (way faster) base R solution. library(tictoc)
tic("tidyr")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- tidyr::nest(bs_data, data = -c(Species))
## Cluster Bootstrap
clusterboot_tidyr <- function(bs_data, method, R, ...) {
boot::boot(
data = bs_data,
statistic = clusterboot_stat_tidyr,
R = R,
method = "pearson",
...
)
}
clusterboot_stat_tidyr <- function(data, index, method) {
resample <- data[index, ]
resample <- tidyr::unnest(resample, cols = data)
resample <- resample[-1]
stats::cor(resample, method = method)
}
set.seed(123)
bs_results <- clusterboot_tidyr(
bs_data = bs_data,
method = method,
R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> tidyr: 3.92 sec elapsed
tic("split")
R <- 1000
bs_data <- iris[, c("Sepal.Length", "Petal.Length", "Species")]
bs_data <- split(bs_data, bs_data$Species)
# clustered bootstrap
## Cluster Bootstrap
clusterboot_split <- function(bs_data, method, R, ...) {
boot::boot(
data = bs_data,
statistic = clusterboot_stat_split,
R = R,
method = "pearson",
...
)
}
clusterboot_stat_split <- function(data, index, method) {
resample <- do.call(rbind, data[index])
stats::cor(resample[, 1:2], method = method)
}
set.seed(123)
bs_results <- clusterboot_split(
bs_data = bs_data,
method = method,
R = R
)
bs_results$t0[2]
#> [1] 0.8717538
toc()
#> split: 0.55 sec elapsed Created on 2021-04-20 by the reprex package (v2.0.0) |
The idea is to bootstrap (sample with replacement and same size) entire clusters including all the observations within each. So not really bootstrap within clusters but across them. Then you calculate the correlation for each resample. Field, C. A., & Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B, 69(3), 369–390. https://doi.org/10/cqwx5p Ren, S., Lai, H., Tong, W., Aminzadeh, M., Hou, X., & Lai, S. (2010). Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics, 37(9), 1487–1498. https://doi.org/10/dvfzcn |
Yes, resampling clusters tends to work remarkably well according to the econometrics lit. |
sd(bs_results_tidyr$t[, 2])
#> [1] 0.1203166
sd(bs_results_split$t[, 2])
#> [1] 0.1203166 |
does that mean that this approach is virtually compatible with any other correlation methods? As in one could do |
Yes, exactly. I'm not sure about the Bayesian methods but it should be compatible with any frequentist approach. We will just have to extend this to call the appropriate correlation-calculating function. Ideally, we would also calculate all requested correlations in a single bootstrapping. This would just require pulling more than one result from bs_results. |
Nice, API-wise, would it make sense having |
I'll defer to you all on the API. But note that you will likely need more than one argument. Whether to use bootstrapping, the number of bootstrap resamples (typically 1000-2000), whether to use the cluster or single bootstrap (can be determined by whether a cluster variable is given), and optionally the type of bootstrap CI to calculate (e.g., percentile or BCA). That is, I recommend making both cluster bootstrapping an option but also just single bootstrapping as an alternative to normal approximation. |
There are also ways of getting a p-value from bootstrapping that I could add if desired. But I'm not as confident in my understanding of that, so I'd want some review of the code. |
BCA intervals tend to work poorly for cluster bootstrap. Bootstrapping with a pivot like the Fisher z seems to work better with clustered data eg, see here--eg, to resample clusters, then compute the Fisher z transformed correlation to construct a distribution and compute quantiles/p values, then back transform to correlation. For clustered bootstrap, we may want to not permit BCA or at least give a Message alerting users, along with defaulting to bootstrap-z. How are bootstrap statistics currently implemented with the iter parameter? |
… is ignored) in correlation
…into cluster-boot # Conflicts: # R/cor_boot.R
@bwiernik Can you confirm that in bootstrap-z, you calculate the bootstrap SE by (1) taking the standard deviation of the z-scores and then transforming this value back to r-scores, or (2) transforming all z-scores back to r-scores and then taking the standard deviation of the r-scores? |
Construct intervals on z, then transform points back to r |
I got the intervals correct, but I'm still unsure about the standard error. Can you weigh in?
|
I've been doing more reading on this topic. I think the bootstrap SE doesn't really matter so much if our goal is to provide BCa intervals for non-clustered bootstraps and quantile intervals for clustered bootstraps. For p-values, it seems that a univariate sampling may be preferable to the typical bivariate sampling especially for smaller samples (Lee & Rodgers, 1998), although this disconnects the p-value from the intervals, which is somewhat awkward. My understanding of the wild bootstrap is that it is helpful when bootstrapping linear models but is not applicable when there aren't residuals as in our use case of correlation coefficients. There's also the idea of iterated bootstrapping (Hall et al., 1989) which seemed to improve the intervals a lot in non-clustered cases, although maybe BCa achieves a similar goal. Hall, P., Martin, M. A., & Schucany, W. R. (1989). Better nonparametric bootstrap confidence intervals for the correlation coefficient. Journal of Statistical Computation and Simulation, 33(3), 161–172. https://doi.org/10/bx23n5 Lee, W.-C., & Rodgers, J. L. (1998). Bootstrapping correlation coefficients using univariate and bivariate sampling. Psychological Methods, 3(1), 91–103. https://doi.org/10/bh2qvj |
What's the status of this? |
What do we think for reporting the standard errors when transformations are applied? If we bootstrap in the z metric, then transform the bounds of CI back to r, should we report the SE in z or r? Arguments for z is that it is the SE used to compute the intervals and that the SE isn't as meaningful on the skewed r scale. Arguments for r is that it is the same metric as the other results. Stats packages differ on their behavior. Stata delta method transforms the SE. Some R packages (and SAS I think) leave it untransformed. Other R packages drop it entirely from output. What do we do when we exponentiate coefficients for a logistic regression? |
I was leaning towards omitting the SE but could be persuaded otherwise. A larger issue seems to be that bootstrapping in the package in general would be best if there were smaller functions that just take in a matrix (or whatever) of "ready to analyze" data and output only the point estimate. My understanding is the package is not currently written this way. |
I would like to make a consistent decision about transformed variables and SEs. The bootstrap question isn't that big a concern to me--cf. we report the SD or MAD of an MCMC distribution even though we don't use that for intervals. |
Would it be too much work to allow the user to toggle between either transformed or untransformed versions? Do you agree that adding bootstrapping to the package would require/be best if we made a highly minimal point estimate function for each type of correlation (and therefore put the estimation of uncertainty in separate functions)? |
I see no value in providing output in the z metric for example |
I meant toggle for the metric of the SE. Didn't you provide some examples for why the z metric SE had value? |
Z is never a useful metric for interpreting. Some programs output the SE of beta coefficients for logistic regression, that at least is arguably useful |
I can't really move forward without an answer from a package maintainer on this. |
Yes, that would be good. |
@IndrajeetPatil @DominiqueMakowski Are you able to respond to @jmgirard (see #168 (comment))? |
@jmgirard yes, a highly minimal point estimate internal function for each type of correlation would be good |
@bwiernik Ok, if this is approved, I can help to refactor a few of the most common correlation functions to permit bootstrapping. I would, of course, appreciate feedback on the approach. |
That would be great. Let me know what sort of feedback you'd like |
@jmgirard Any update on this? Do you still wish to continue working on this? Let us know if you are waiting for something from our end that is blocking your progress. |
Sorry, I just got really busy. I am still interested but my progress may be slow. |
No worries! Just wanted to make that we weren't the roadblock 🙃 Take your time! |