You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description of the bug
I am revisiting the analysis of Athey and Wager (2019). I am interested in running a falsification analysis where the causal forests are trained not on the student and school covariates but rather on randomly generated vectors. My prior is that the heterogeneity tests should fail to reject the null of no heterogeneity. However when comparing subsets with high and low estimated CATEs, the estimated average treatment effect on the high subset is close to zero and the estimated average treatment effect on the low subset is order of magnitudes larger. I can't find an explanation for this behavior. Is it a bug?
The other tests seem fine. The global ATE is close enough to the original results. The calibration test fails to reject the null of no heterogeneity.
Steps to reproduce
library(grf)
df= read.csv("experiments/acic18/synthetic_data.csv")
X=matrix(runif(n=nrow(df)*10),nrow=nrow(df))
X.colnames= c("RF1","RF2","RF3","RF4","RF5","RF6","RF7","RF8","RF9","RF0")
Z=df$ZY=df$YY.forest= regression_forest(
X ,
Y
)
Y.hat= predict(Y.forest)$predictionsZ.forest= regression_forest(
X ,
Z
)
Z.hat= predict(Z.forest)$predictionscf.raw= causal_forest(
X,
Y,
Z,
Y.hat=Y.hat,
W.hat=Z.hat
)
varimp= variable_importance(cf.raw)
selected.idx= which(varimp> mean(varimp))
cf= causal_forest(
X[,selected.idx],
Y,
Z,
Y.hat=Y.hat,
W.hat=Z.hat,
tune.parameters="all"
)
tau.df= predict(cf,estimate.variance=TRUE)[,c(1,2)]
tau.hat=tau.df$predictions# Distribution of predicted effects
hist(tau.hat)
# Average trearment effectATE= average_treatment_effect(cf)
paste(
"95% CI for the ATE:",
round(ATE[1],3),
"+/-",
round(qnorm(0.975)*ATE[2],3)
)
Outputs: '95% CI for the ATE: 0.303 +/- 0.026'
# Compare regions with high and low estimated CATEhigh_effect=tau.hat.unsorted> median(tau.hat.unsorted)
ate.high= average_treatment_effect(cf, subset=high_effect)
ate.low= average_treatment_effect(cf, subset=!high_effect)
paste(
"95% CI for the difference in ATE:",
round(ate.high[1] -ate.low[1],3),
"+/-",
round(qnorm(0.975)*sqrt(ate.high[2]^2+ate.low[2]^2),3)
)
Outputs: '95% CI for the difference in ATE: -0.56 +/- 0.051'
Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:
Estimate Std. Error t value Pr(>t)
mean.forest.prediction 1.001729 0.041462 24.160 <2e-16 ***
differential.forest.prediction -682.911383 24.255158 -28.155 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
GRF version
grf_2.2.1
The text was updated successfully, but these errors were encountered:
RamirezAmayaS
changed the title
ATE over subset with low and high estimated CATEs - nonsensical results
ATE over subsets with low and high estimated CATEs - nonsensical results
Apr 9, 2023
Hi @RamirezAmayaS, what you are observing is unfortunately a known artifact of doing these kinds of evaluations using Out-of-Bag (OOB) estimates. The suggested modern approach is to use the RATE with a training and evaluation sample. If you repeat your example from above, then you should see a flat TOC curve / zero RATE (when using a train/test split).
Hi @erikcs , thanks for the suggestion. I'll try the RATE approach. Do you know of any reference explaining why the OOB evaluation fails by any chance?
Description of the bug
I am revisiting the analysis of Athey and Wager (2019). I am interested in running a falsification analysis where the causal forests are trained not on the student and school covariates but rather on randomly generated vectors. My prior is that the heterogeneity tests should fail to reject the null of no heterogeneity. However when comparing subsets with high and low estimated CATEs, the estimated average treatment effect on the high subset is close to zero and the estimated average treatment effect on the low subset is order of magnitudes larger. I can't find an explanation for this behavior. Is it a bug?
The other tests seem fine. The global ATE is close enough to the original results. The calibration test fails to reject the null of no heterogeneity.
Steps to reproduce
Outputs: '95% CI for the ATE: 0.303 +/- 0.026'
Outputs: '95% CI for the difference in ATE: -0.56 +/- 0.051'
Outputs: estimate:-0.00124768810374905 std.err: 0.0182608951524164
Outputs: estimate: 0.608046759001875 std.err: 0.0182508648601049
Outputs:
GRF version
grf_2.2.1
The text was updated successfully, but these errors were encountered: