diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..fc1fe52 Binary files /dev/null and b/.DS_Store differ diff --git a/.Rbuildignore b/.Rbuildignore index f469846..c7eb07f 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -5,3 +5,8 @@ ^development ^\.github$ +^cran-comments\.md$ +^LICENSE\.md$ +^_pkgdown\.yml$ +^docs$ +^pkgdown$ diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml new file mode 100644 index 0000000..4bbce75 --- /dev/null +++ b/.github/workflows/pkgdown.yaml @@ -0,0 +1,50 @@ +# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples +# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help +on: + push: + branches: [main, master] + pull_request: + branches: [main, master] + release: + types: [published] + workflow_dispatch: + +name: pkgdown.yaml + +permissions: read-all + +jobs: + pkgdown: + runs-on: ubuntu-latest + # Only restrict concurrency for non-PR jobs + concurrency: + group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} + env: + GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + permissions: + contents: write + steps: + - uses: actions/checkout@v4 + + - uses: r-lib/actions/setup-pandoc@v2 + + - uses: r-lib/actions/setup-r@v2 + with: + use-public-rspm: true + + - uses: r-lib/actions/setup-r-dependencies@v2 + with: + extra-packages: any::pkgdown, local::. + needs: website + + - name: Build site + run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE) + shell: Rscript {0} + + - name: Deploy to GitHub pages 🚀 + if: github.event_name != 'pull_request' + uses: JamesIves/github-pages-deploy-action@v4.5.0 + with: + clean: false + branch: gh-pages + folder: docs diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml new file mode 100644 index 0000000..0c3a164 --- /dev/null +++ b/.github/workflows/test-coverage.yaml @@ -0,0 +1,31 @@ +name: Run tests and upload coverage + +on: + push: + branches: [main, master] + pull_request: + branches: [main, master] + +jobs: + test: + name: Run tests and collect coverage + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Set up Node + uses: actions/setup-node@v4 + + - name: Install dependencies + run: npm install + + - name: Run tests + run: npx jest --coverage + + - name: Upload results to Codecov + uses: codecov/codecov-action@v4 + with: + token: ${{ secrets.CODECOV_TOKEN }} \ No newline at end of file diff --git a/.gitignore b/.gitignore index 093b1e7..698c87f 100644 --- a/.gitignore +++ b/.gitignore @@ -37,4 +37,6 @@ vignettes/*.pdf #interactive stuff *development/old -*development/output \ No newline at end of file +*development/output +docs +inst/doc diff --git a/DESCRIPTION b/DESCRIPTION index 2e6ffd1..14c74f2 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,28 +1,37 @@ Package: fastdid Type: Package -Title: lightning-fast staggered Difference-in-Difference estimators -Version: 0.9.4 -Date: 2024-08-05 +Title: Fast Staggered Difference-in-Difference Estimators +Version: 0.9.9 +Date: 2024-09-08 Authors@R: c( person("Lin-Tung","Tsai", - role = c("aut", "cre"), + role = c("aut", "cre", "cph"), email = "tsaidondon@gmail.com"), person(given = "Maxwell", family = "Kellogg", , - role = "aut"), + role = "ctb"), person(given = "Kuan-Ju", family = "Tseng", , - role = "aut")) + role = "ctb")) Maintainer: Lin-Tung Tsai -Description: **fastdid** is a lightning-fast implementation of Callaway and Sant'Anna's (2021) staggered Difference-in-Differences (DiD) estimators. DiD setup for millions of units used to take hours to run. With **fastdid**, it takes seconds. -License: GPL (>= 2) +Description: A fast and flexible implementation of Callaway and Sant'Anna's (2021) staggered Difference-in-Differences (DiD) estimators and its extensions, fastdid reduces the computation time from hours to seconds, and incorporates extensions such as time-varying covariates and multiple events. +License: MIT + file LICENSE Imports: data.table, stringr, BMisc, collapse, dreamerr (>= 1.4.0), - parglm -Suggests: ggplot2, did + parglm, + ggplot2 +Suggests: + did, + knitr, + parallel, + rmarkdown, + tinytest Encoding: UTF-8 RoxygenNote: 7.3.2 +URL: https://github.com/TsaiLintung/fastdid, https://tsailintung.github.io/fastdid/ +BugReports: https://github.com/TsaiLintung/fastdid/issues +VignetteBuilder: knitr diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..e1e30d3 --- /dev/null +++ b/LICENSE @@ -0,0 +1,2 @@ +YEAR: 2024 +COPYRIGHT HOLDER: fastdid authors diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 0000000..c838e64 --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,21 @@ +# MIT License + +Copyright (c) 2024 fastdid authors + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/NAMESPACE b/NAMESPACE index b902a7c..758cbdf 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -3,17 +3,20 @@ export(fastdid) export(plot_did_dynamics) export(sim_did) -import(BMisc) import(data.table) import(dreamerr) -import(parglm) +import(ggplot2) import(stringr) +importFrom(BMisc,multiplier_bootstrap) importFrom(collapse,allNA) importFrom(collapse,fnrow) importFrom(collapse,fnunique) importFrom(collapse,fsum) importFrom(collapse,na_insert) importFrom(collapse,whichNA) +importFrom(parallel,mclapply) +importFrom(parglm,parglm.control) +importFrom(parglm,parglm.fit) importFrom(stats,as.formula) importFrom(stats,binomial) importFrom(stats,fitted) @@ -22,3 +25,4 @@ importFrom(stats,quantile) importFrom(stats,rnorm) importFrom(stats,sd) importFrom(stats,vcov) +importFrom(stats,weighted.mean) diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 0000000..6d11556 --- /dev/null +++ b/NEWS.md @@ -0,0 +1,40 @@ +# 0.9.9 (2024/9/1) + +- add double did (see the vignette for the introduction) +- add `parallel`, parallization for unix systems, useful if the number of g-t is large. +- add `full`, return full result such as influence function, aggregate scheme, and such +- add `min`/`max_dynamic`, `custom_scheme` to experimental features + +# 0.9.4 (2024/8/2) + +> [!WARNING] +> Some BREAKING change is introduced in this update. + +- add uniform confidence interval option with `cband` and significance level `alpha`, confidence interval are now provided in result as column `att_ciub` and `att_cilb` +- BREAKING: `filtervar`, `max_control_cohort_diff`, `min_control_cohort_diff` are moved into the experimental features. See the above section for the explanation. +- add `max_dynamic` and `min_dynamic` as experimental features. +- more informative error message when estimation fails for a specific `gt`, some internal interface overhaul + +# 0.9.3 (2024/5/7) + +- add anticipation and varying base period option +- add min and max control cohort difference +- add time-varying control ([reference](https://arxiv.org/abs/2202.02903)) +- add filtervar + +0.9.3.1 (2024/5/24): fix the bug with `univar == clustervar` (TODO: address problems with name-changing and collision). +0.9.3.2 (2024/7/17): fix group_time result when using `control_type = "notyet"` and make the base period in plots adapt to anticipation. +0.9.3.3 (2024/7/22): fix anticipation out of bound problem, more permanent solution for group_time target problem + +# 0.9.2 (2023/12/20) + +- add support to doubly robust and outcome regression estimators +- add support to unbalanced panels (simple and ipw only) +- add support to balanced composition option in dynamics aggregation +- fixed argument checking that was not working properly +- set the default to copying the entire dataset to avoid unexpected modification of the original data (thanks @grantmcdermott for the suggestion.) + +# 0.9.1 (2023/10/20) + +- now supprts estimation for multiple outcomes in one go! +- data validation: no longer check missing values for columns not used. \ No newline at end of file diff --git a/R/aggregate_gt.R b/R/aggregate_gt.R index 33db8ee..1603ae5 100644 --- a/R/aggregate_gt.R +++ b/R/aggregate_gt.R @@ -1,67 +1,126 @@ +# high level ------------------------------------------------------------------- + aggregate_gt <- function(all_gt_result, aux, p){ - rbindlist(lapply(all_gt_result, aggregate_gt_outcome, aux, p)) + results <- lapply(all_gt_result, aggregate_gt_outcome, aux, p) + return(list( + est = rbindlist(lapply(results, function(x) {x$result})), + inf_func = lapply(results, function(x) {x$inf_func}), + agg_weight_matrix = lapply(results, function(x) {x$weight_matrix}) + )) } aggregate_gt_outcome <- function(gt_result, aux, p){ - + + #get aggregation scheme from g-t to target parameters agg_sch <- get_agg_sch(gt_result, aux, p) + + if(!p$event_specific | is.na(p$cohortvar2)){ + att <- gt_result$att + inf_func <- gt_result$inf_func %*% t(agg_sch$agg_weights) + } else { + att <- agg_sch$es_weight %*% gt_result$att + inf_func <- gt_result$inf_func %*% t(agg_sch$agg_weights %*% agg_sch$es_weight) + } + + #get att + agg_att <- agg_sch$agg_weights %*% att + + #get influence function matrix - #get att and se - agg_att <- get_agg_att(gt_result, agg_sch, p) - inf_matrix <- get_agg_inf(gt_result, agg_sch, aux, p) - agg_se_result <- get_se(inf_matrix, aux, p) + inf_weights <- get_weight_influence(att, agg_sch, aux, p) + inf_matrix <- inf_func + inf_weights + + #get se + agg_se <- get_se(inf_matrix, aux, p) # post process - result <- data.table(agg_sch$targets, agg_att, agg_se_result$se) + result <- data.table(agg_sch$targets, agg_att, agg_se$se) names(result) <- c("target", "att", "se") result[,`:=`(outcome = gt_result$outname, - att_ciub = att+se*agg_se_result$crit_val, - att_cilb = att-se*agg_se_result$crit_val)] + att_ciub = att+se*agg_se$crit_val, + att_cilb = att-se*agg_se$crit_val)] - return(result) + return(list(result = result, + inf_func = inf_matrix, + weight_matrix = agg_sch$agg_weights)) } +# scheme ------------------------------------------------------------------------ + #scheme for aggregation get_agg_sch <- function(gt_result, aux, p){ - - #setup stuff - weights <- aux$weights - id_cohorts <- aux$dt_inv[, G] - result_type <- p$result_type - agg_weights <- data.table() - + #create group_time - id_dt <- data.table(weight = weights/sum(weights), G = id_cohorts) + id_dt <- data.table(weight = aux$weights/sum(aux$weights), G = aux$dt_inv[, G]) pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] - group_time <- gt_result$gt |> merge(pg_dt, by = "G") - setorder(group_time, time, G) #change the order to match the order in gtatt - gt_count <- group_time[, .N] + group_time <- gt_result$gt |> merge(pg_dt, by = "G", sort = FALSE) + group_time[, mg := ming(G)] + group_time[, G1 := g1(G)] + group_time[, G2 := g2(G)] + setorder(group_time, time, mg, G1, G2) #change the order to match the order in gtatt + if(!all(names(gt_result$att) == group_time[, paste0(G, ".", time)])){stop("some bug makes gt misaligned, please report this to the maintainer. Thanks.")} + + #get the event-specific matrix, and available ggts + if(p$event_specific & !is.na(p$cohortvar2)){ + es <- get_es_scheme(group_time, aux, p) + group_time <- es$group_time #some gt may not have availble effect (ex: g1 == g2) + es_weight <- as.matrix(es$es_weight) + } else { + es_weight <- NULL + } + + #choose the target based on aggregation type + tg <- get_agg_targets(group_time, p) + group_time <- tg$group_time + targets <- tg$targets - #nothing to do - if(result_type == "group_time"){ - return(list(targets = group_time[, paste0(G, ".", time)])) + + #get aggregation weights + agg_weights <- data.table() + for(tar in targets){ #the order matters + + group_time[, weight := 0] #weight is 0 if not a target + group_time[target == tar & used, weight := pg/sum(pg)] + target_weights <- group_time[, .(weight)] |> transpose() + + agg_weights <- rbind(agg_weights, target_weights) } + group_time[, pg := NULL] - #choose the target based on aggregation type - group_time[, post := as.numeric(ifelse(time >= G, 1, -1))] - if (result_type == "dynamic") { - group_time[, target := time-G] - } else if (result_type == "group") { - group_time[, target := G*post] # group * treated - } else if (result_type == "time") { - group_time[, target := time*post] #calendar time * treated - } else if (result_type == "simple") { - group_time[, target := post] #treated / not treated - } - - targets <- sort(group_time[, unique(target)]) + agg_weights <- as.matrix(agg_weights) + + return(list(agg_weights = agg_weights, #a matrix of each target and gt's weight in it + targets = targets, + group_time = group_time, + es_weight = es_weight)) +} + +#get the target parameters +get_agg_targets <- function(group_time, p){ + group_time[, post := as.numeric(ifelse(time >= g1(G), 1, -1))] + switch(p$result_type, + dynamic = group_time[, target := time-g1(G)], + group = group_time[, target := g1(G)*post], # group * treated + time = group_time[, target := time*post], + simple = group_time[, target := post], + group_time = group_time[, target := paste0(g1(G), ".", time)], + group_group_time = group_time[, target := paste0(G, ".", time)] , + dynamic_stagger = group_time[, target := paste0(time-g1(G), ".", g1(G)-g2(G))] + ) + + #allow custom aggregation scheme, this overides other stuff + if(!is.na(p$exper$aggregate_scheme)){ + group_time[, target := eval(str2lang(p$exper$aggregate_scheme))] + } + + targets <- group_time[, unique(target)] #for balanced cohort composition in dynamic setting #a cohort us only used if it is seen for all dynamic time - if(result_type == "dynamic" & !is.na(p$balanced_event_time)){ + if(p$result_type == "dynamic" & !is.na(p$balanced_event_time)){ - cohorts <- group_time[, .(max_et = max(time-G), - min_et = min(time-G)), by = "G"] + cohorts <- group_time[, .(max_et = max(target), #event time is target if in dynamic + min_et = min(target)), by = "G"] cohorts[, used := max_et >= p$balanced_event_time] #the max if(!cohorts[, any(used)]){stop("balanced_comp_range outside avalible range")} group_time[, used := G %in% cohorts[used == TRUE, G]] @@ -70,75 +129,67 @@ get_agg_sch <- function(gt_result, aux, p){ } else{group_time[, used := TRUE]} - for(tar in targets){ #the order matters - - group_time[, targeted := target == tar & used] - - total_pg <- group_time[targeted == TRUE, sum(pg)] #all gt that fits in the target - group_time[, weight := ifelse(targeted, pg/total_pg, 0)] #weight is 0 if not a target - target_weights <- group_time[, .(weight)] |> transpose() - - group_time[, targeted := NULL] - - agg_weights <- rbind(agg_weights, target_weights) - } + return(list(group_time = group_time, targets = targets)) - return(list(agg_weights = as.matrix(agg_weights), #a matrix of each target and gt's weight in it - targets = targets, - group_time = group_time)) } -#aggregated influence function -get_agg_inf <- function(gt_result, agg_sch, aux, p){ +# influence function ------------------------------------------------------------ + +get_weight_influence <- function(att, agg_sch, aux, p){ + + group <- agg_sch$group_time + + id_dt <- data.table(weight = aux$weights/sum(aux$weights), G = aux$dt_inv[, G]) + pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] + group <- group |> merge(pg_dt, by = "G", sort = FALSE) + + group[, time := as.integer(time)] + + if(is.na(p$cohortvar2)){ + group[, G := as.integer(G)] + setorder(group, time, G) + } else { + group[, mg := ming(G)] + group[, G1 := g1(G)] + group[, G2 := g2(G)] + setorder(group, time, mg, G1, G2) #sort + } + + if(!p$parallel){ + inf_weights <- sapply(asplit(agg_sch$agg_weights, 1), function (x){ + get_weight_influence_param(x, group, att, aux, p) + }) + } else { + inf_weights <- matrix(unlist(mclapply(asplit(agg_sch$agg_weights, 1), function (x){ + get_weight_influence_param(x, group, att, aux, p) + })), ncol = length(agg_sch$targets)) + } - if(p$result_type == "group_time"){return(gt_result$inf_func)} - inf_weights <- sapply(asplit(agg_sch$agg_weights, 1), function (x){ - get_weight_influence(x, gt_result$att, aux$weights, aux$dt_inv[, G], agg_sch$group_time[, .(G, time)]) - }) + return(inf_weights) - #aggregated influence function - inf_matrix <- (gt_result$inf_func %*% t(agg_sch$agg_weights)) + inf_weights - return(inf_matrix) } #influence from weight calculation -get_weight_influence <- function(agg_weights, gt_att, weights, id_cohorts, group) { - - keepers <- which(agg_weights > 0) +get_weight_influence_param <- function(agg_weights, group, gt_att, aux, p) { + + keepers <- which(agg_weights != 0) + group <- group[keepers,] + #moving this outside will create a g*t*id matrix, not really worth the memory + keepers_matrix <- as.matrix(aux$weights*sapply(1:nrow(group), function(g){as.integer(aux$dt_inv[, G] == group[g,G]) - group[g,pg]})) - id_dt <- data.table(weight = weights/sum(weights), G = id_cohorts) - pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] - group <- group |> merge(pg_dt, by = "G") + # gt weight = pgi / sum(pgi) + if1 <- keepers_matrix/sum(group[,pg]) #numerator + if2 <- rowSums(keepers_matrix) %*% t(group[,pg])/(sum(group[,pg])^2) #denominator - group[, time := as.integer(time)] - group[, G := as.integer(G)] - setorder(group, time, G) - - # effect of estimating weights in the numerator - if1 <- sapply(keepers, function(k) { - (weights*BMisc::TorF(id_cohorts == group[k,G]) - group[k,pg]) / - sum(group[keepers,pg]) - }) - # effect of estimating weights in the denominator - if2 <- base::rowSums(sapply(keepers, function(k) { - weights*BMisc::TorF(id_cohorts == group[k,G]) - group[k,pg] - })) %*% - t(group[keepers,pg]/(sum(group[keepers,pg])^2)) # return the influence function for the weights inf_weight <- (if1 - if2) %*% as.vector(gt_att[keepers]) inf_weight[abs(inf_weight) < sqrt(.Machine$double.eps)*10] <- 0 #fill zero + return(inf_weight) } -#aggregated att -get_agg_att <- function(gt_result, agg_sch, p){ - if(p$result_type == "group_time"){ - return(as.vector(gt_result$att)) - } else { - return(agg_sch$agg_weights %*% gt_result$att) - } -} +# se ------------------------------------------------------------------- #aggregated standard error get_se <- function(inf_matrix, aux, p) { @@ -166,9 +217,6 @@ get_se <- function(inf_matrix, aux, p) { #get sigma se <- dt_se[,(boot_top-boot_bot)/(qnorm(top_quant) - qnorm(bot_quant))] se[se < sqrt(.Machine$double.eps)*10] <- NA - - - } else { diff --git a/R/aux_funcs.R b/R/aux_funcs.R new file mode 100644 index 0000000..1278ecf --- /dev/null +++ b/R/aux_funcs.R @@ -0,0 +1,248 @@ +# auxilary steps in the main fastdid function + +get_exper_default <- function(exper, exper_args){ + for(arg in exper_args){ + if(is.null(exper[[arg]])){ + exper[[arg]] <- NA + } + } + + if(!is.na(exper$only_balance_2by2) & exper$only_balance_2by2){ #will create this col in the get_aux part + exper$filtervar <- "no_na" + exper$filtervar_post <- "no_na" + } + + return(exper) +} + +coerce_dt <- function(dt, p){ + + if(!is.na(p$cohortvar2)){return(coerce_dt_doub(dt, p))} #in doubledid.R + + #chcek if there is availble never-treated group + if(!is.infinite(dt[, max(G)]) & p$control_option != "notyet"){ + warning("no never-treated availble, effectively using not-yet-treated control") + } + + if(p$allow_unbalance_panel){ + dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] + setorder(dt_inv_raw, G) + dt_inv_raw[, new_unit := 1:.N] #let unit start from 1 .... N, useful for knowing which unit is missing + dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit", sort = FALSE) + dt[, unit := new_unit] + } + + setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access + + #deal with time, coerice time to 1,2,3,4,5....... + time_periods <- dt[, unique(time)] + time_size <- length(time_periods) + + #TODO: this part is kinda ugly + time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) + gcol <- str_subset(names(dt), ifelse(is.na(p$cohortvar2), "G", "G1|G2")) + if(time_offset != 0){ + dt[, G := G-time_offset] + + dt[, time := time-time_offset] + time_periods <- time_periods - time_offset + } + + time_step <- 1 #time may not jump at 1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ + time_step <- time_periods[2]-time_periods[1] + time_periods <- (time_periods-1)/time_step+1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} + dt[G != 1, G := (G-1)/time_step+1] + + dt[time != 1, time := (time-1)/time_step+1] + } + + #add the information to t + t <- list() + t$time_step <- time_step + t$time_offset <- time_offset + + if(nrow(dt) == 0){ + stop("no data after coercing the dataset") + } + + return(list(dt = dt, p = p, t = t)) + +} + +get_auxdata <- function(dt, p){ + + time_periods <- dt[, unique(time)] + id_size <- dt[, uniqueN(unit)] + + #construct the outcomes list for fast access later + #loop for multiple outcome + outcomes_list <- list() + for(outcol in p$outcomevar){ + outcomes <- list() + + if(!p$allow_unbalance_panel){ + for(i in time_periods){ + start <- (i-1)*id_size+1 + end <- i*id_size + outcomes[[i]] <- dt[seq(start,end), get(outcol)] + } + } else { + + for(i in time_periods){ + #populate a outcome vector of length N with outcome data in the right place + #NA is data gone or missing, will be addressed in estimate_did_rc + outcome_period <- rep(NA, id_size) + data_pos <- dt[time == i, unit] #units observed in i + outcome_period[data_pos] <- dt[time == i, get(outcol)] + outcomes[[i]] <- outcome_period + } + + } + + outcomes_list[[outcol]] <- outcomes + } + + #the time-invariant parts + if(!p$allow_unbalance_panel){ + dt_inv <- dt[1:id_size] + } else { + dt_inv <- dt[dt[, .I[1], by = unit]$V1] #the first observation + setorder(dt_inv, unit) #can't move this outside + } + + cohorts <- dt_inv[, unique(G)] + cohort_sizes <- dt_inv[, .(cohort_size = .N) , by = G] + + # the optional columns + varycovariates <- list() + if(!allNA(p$varycovariatesvar)){ + for(i in time_periods){ + start <- (i-1)*id_size+1 + end <- i*id_size + varycovariates[[i]] <- dt[seq(start,end), .SD, .SDcols = p$varycovariatesvar] + } + } else { + varycovariates <- NA + } + + #create na indicator for filtering + if(!is.na(p$exper$only_balance_2by2) & p$exper$only_balance_2by2){ + if("no_na" %in% names(dt)){stop("no_na is already in dt, consider using another column name")} + varnames <- unlist(p[str_ends(names(p), "var")], recursive = TRUE) #get all the argument that ends with "var" + varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar) & !is.na(varnames) & !is.null(varnames)] + dt[, no_na := TRUE] + for(col in varnames){ #nothing can be + if(is.na(col)){next} + dt[is.na(get(col)), no_na := TRUE] + } + } + + # filters + filters <- list() + if(!is.na(p$exper$filtervar)){ + for(t in time_periods){ + filters[[t]] <- rep(NA, id_size) + data_pos <- dt[time == t, unit] #units observed in i + filters[[t]][data_pos] <- unlist(dt[time == t, .SD, .SDcols = p$exper$filtervar]) + } + } else { + filters <- NA + } + + if(!allNA(p$covariatesvar)){ + covariates <- dt_inv[,.SD, .SDcols = p$covariatesvar] + } else { + covariates <- NA + } + + if(!is.na(p$clustervar)){ + cluster <- dt_inv[, .SD, .SDcols = p$clustervar] |> unlist() + } else { + cluster <- NA + } + + if(!is.na(p$weightvar)){ + weights <- dt_inv[, .SD, .SDcols = p$weightvar] |> unlist() + weights <- weights/mean(weights) #normalize + } else { + weights <- rep(1, id_size) + } + + aux <- as.list(environment()) + aux$dt <- NULL + aux$p <- NULL + class(aux) <- "locked" + + return(aux) + +} + +convert_targets <- function(results, p, t){ + + if(!is.na(p$exper$aggregate_scheme)){return(results)} #no conversion back if use custom + + switch(p$result_type, + dynamic = { + results[, event_time := target] + setcolorder(results, "event_time", before = 1) + }, + group = { + results[, type := ifelse(target >= 0, "post", "pre")] + results[, cohort := recover_time(abs(target), t)] + setcolorder(results, "cohort", before = 1) + }, + time = { + results[, type := ifelse(target >= 0, "post", "pre")] + results[, time := recover_time(abs(target), t)] + setcolorder(results, "time", before = 1) + }, + simple = { + results[, type := ifelse(target >= 0, "post", "pre")] + }, + group_time = { + results[, cohort := as.numeric(str_split_i(target, "\\.", 1))] + results[, time := as.numeric(str_split_i(target, "\\.", 2))] + + #recover the time + results[, cohort := recover_time(cohort, t)] + results[, time := recover_time(time, t)] + + }, + group_group_time = { + results[, cohort := str_split_i(target, "\\.", 1)] + results[, time := as.numeric(str_split_i(target, "\\.", 2))] + + results[, cohort1 := g1(cohort)] + results[, cohort2 := g2(cohort)] + + results[, cohort1 := recover_time(cohort1, t)] + results[, cohort2 := recover_time(cohort2, t)] + results[, time := recover_time(time, t)] + results[, `:=`(cohort = NULL)] + }, + dynamic_sq = { + results[, event_time_1 := as.numeric(str_split_i(target, "\\.", 1))] + results[, event_stagger := as.numeric(str_split_i(target, "\\.", 2))] + } + ) + + results[, target := NULL] + + return(results) +} + +# small stuff --------- + +recover_time <- function(time, t){ + return(((time-1)*t$time_step)+1+t$time_offset) +} + +#locked list +#from: https://stackoverflow.com/questions/58776481/make-r-function-return-a-locked-immutable-list +.S3method("[[<-", "locked", function(value) {stop("Can't assign into locked object")}) +.S3method("[<-", "locked", function(value) {stop("Can't assign into locked object")}) +.S3method("$<-", "locked", function(value) {stop("Can't assign into locked object")}) +# a <- list(b = 1, c = 2) +# class(a) <- c("locked") \ No newline at end of file diff --git a/R/double_did.R b/R/double_did.R new file mode 100644 index 0000000..19b7cf0 --- /dev/null +++ b/R/double_did.R @@ -0,0 +1,183 @@ + + +# utils ----------------------------------- + +#g11 <- c(1,1,2,3,4) +#g22 <- c(2,1,3,2,4) +#GG <- as.factor(paste0(g11, ".", g22)) + +g1 <- function(GG){ + if(is.numeric(GG)){return(GG)} + return(as.numeric(str_split_i(GG, "-", 1))) +} + +g2 <- function(GG){ + return(as.numeric(str_split_i(GG, "-", 2))) +} + +ming <- function(GG){ + if(is.numeric(GG)){return(GG)} + else {pmin(g1(GG), g2(GG))} +} + +# overiden function ------------------------------------------------- + +coerce_dt_doub <- function(dt, p){ + + #chcek if there is availble never-treated group + if(!is.infinite(dt[, max(ming(G))]) & p$control_option != "notyet"){ + warning("no never-treated availble, effectively using not-yet-treated control") + } + + setnames(dt, "G", "G1") + dt[, mg := pmin(G1, G2)] + setorder(dt, time, mg, G1, G2, unit) #for sort one quick access + + if(p$allow_unbalance_panel){ #let unit start from 1 .... N, useful for knowing which unit is missing + dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] + setorder(dt_inv_raw, mg, G1, G2) + dt_inv_raw[, new_unit := 1:.N] + dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit", sort = FALSE) + dt[, unit := new_unit] + } + + #deal with time, coerice time to 1,2,3,4,5....... + time_periods <- dt[, unique(time)] + time_size <- length(time_periods) + + time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) + gcol <- c("G1", "G2") + if(time_offset != 0){ + dt[, c(gcol) := .SD-time_offset, .SDcols = gcol] + dt[, time := time-time_offset] + time_periods <- time_periods - time_offset + } + + time_step <- 1 #time may not jump at 1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ + time_step <- time_periods[2]-time_periods[1] + time_periods <- (time_periods-1)/time_step+1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} + + for(g in gcol){ + dt[get(g) != 1, c(g) := (get(g)-1)/time_step+1] + } + + dt[time != 1, time := (time-1)/time_step+1] + } + dt[, mg := pmin(G1, G2)] + dt[, G := paste0(G1, "-", G2)] #create G once its finalized + + #add the information to t + t <- list() + t$time_step <- time_step + t$time_offset <- time_offset + + if(nrow(dt) == 0){ + stop("no data after coercing the dataset") + } + + return(list(dt = dt, p = p, t = t)) + +} + +# aggregation scheme ----------------------------------------------------------- + +#the scheme for getting event-specific effect +get_es_scheme <- function(group_time, aux, p){ + + es_group_time <- copy(group_time) #group_time with available es effect + #create lookup + es_group_time[, mg := ming(G)] + es_group_time[, G1 := g1(G)] + es_group_time[, G2 := g2(G)] + es_weight_list <- list() + + ggt <- as.list(seq_len(nrow(group_time))) + if(!p$parallel){ + es_weight_list <- lapply(ggt, get_es_ggt_weight, group_time, aux, p) + } else { + es_weight_list <- mclapply(ggt, get_es_ggt_weight, group_time, aux, p, mc.cores = getDTthreads()) + } + + valid_ggt <- which(!sapply(es_weight_list, is.null)) + es_group_time <- es_group_time[valid_ggt] #remove the ones without + es_weight_list <- es_weight_list[valid_ggt] + es_weight <- do.call(rbind, es_weight_list) #not sure if the dim is right + + return(list(group_time = es_group_time, es_weight = es_weight)) + +} + +#get the scheme for retriving group-group-time estimates +get_es_ggt_weight <- function(ggt, group_time, aux, p){ + + group_time <- copy(group_time) #avoid accidental modification + + group_time[, weight := 0] #reset + t <- group_time[ggt, time] + g1 <- group_time[ggt, G1] + g2 <- group_time[ggt, G2] + gg <- group_time[ggt, G] + + if(is.infinite(g1)){return(NULL)} + + if(t < g2){ #direct pure effect + + group_time[ggt, weight := 1] #just use the observed effect + + } else if(g1 < g2) { #imputation = treat-pre + (control-post - control-pre) + + base_period <- g2 - 1 + if(base_period == t){return(NULL)} + min_control_cohort <- ifelse(p$double_control_option == "never", Inf, max(t,base_period)+1) + + #get the cohorts + tb <- group_time[,G == gg & time == base_period] + c <- group_time[, G1 == g1 & G2 >= min_control_cohort & G2 != g2] + if(p$control_option == "notyet"){ + c[group_time[, is.infinite(G2)]] <- FALSE + } + cp <- group_time[, c & time == t] + cb <- group_time[, c & time == base_period] + + #if any group have no available cohort, skip + if(sum(tb) == 0 | sum(cp) == 0 | sum(cb) == 0){return(NULL)} + + #assign the weights + group_time[tb, weight := pg/sum(pg)] + group_time[cp, weight := pg/sum(pg)] + group_time[cb, weight := -pg/sum(pg)] + + + } else if (g1 > g2) { #double did = (treat-post - treat-base) - (control-post - control-pre) + + base_period <- g1 - 1 + if(base_period == t){return(NULL)} + min_control_cohort <- ifelse(p$double_control_option == "never", Inf, max(t,base_period)+1) + + #get the cohorts + tp <- group_time[,.I == ggt] + tb <- group_time[,G == gg & time == base_period] + c <- group_time[,G2 == g2 & G1 >= min_control_cohort & G1 != g1] + if(p$control_option == "notyet"){ + c[group_time[, is.infinite(G1)]] <- FALSE + } + cp <- group_time[, c & time == t] + cb <- group_time[, c & time == base_period] + + #if any group have no available cohort, skip + if(sum(tp) == 0 | sum(tb) == 0 | sum(cp) == 0 | sum(cb) == 0){return(NULL)} + + #assign the weights + group_time[tp, weight := pg/sum(pg)] + group_time[tb, weight := -pg/sum(pg)] + group_time[cp, weight := -pg/sum(pg)] + group_time[cb, weight := pg/sum(pg)] + + } + + if(all(group_time[, weight] == 0)){return(NULL)} + return(group_time[, weight]) + +} diff --git a/R/estimate_did.R b/R/estimate_did.R index a3ab824..461b8c6 100644 --- a/R/estimate_did.R +++ b/R/estimate_did.R @@ -1,4 +1,4 @@ -estimate_did <- function(dt_did, covvars, p, last_coef, cache){ +estimate_did <- function(dt_did, covvars, p, cache){ #estimate did param <- as.list(environment()) @@ -10,8 +10,7 @@ estimate_did <- function(dt_did, covvars, p, last_coef, cache){ return(result) } -estimate_did_bp <- function(dt_did, covvars, p, - last_coef = NULL, cache){ +estimate_did_bp <- function(dt_did, covvars, p, cache){ # preprocess -------- oldn <- dt_did[, .N] @@ -33,10 +32,10 @@ estimate_did_bp <- function(dt_did, covvars, p, if(ipw){ if(is.null(cache)){ #if no cache, calcuate ipw #estimate the logit - prop_score_est <- suppressWarnings(parglm.fit(covvars, dt_did[, D], - family = stats::binomial(), start = last_coef, + prop_score_est <- suppressWarnings(parglm::parglm.fit(covvars, dt_did[, D], + family = stats::binomial(), weights = dt_did[, weights], - control = parglm.control(nthreads = getDTthreads()), + control = parglm.control(nthreads = ifelse(p$parallel, 1, getDTthreads())), #no parallel if already parallel intercept = FALSE)) class(prop_score_est) <- "glm" #trick the vcov function to think that this is a glm object to dispatch the write method #const is implicitly put into the ipw formula, need to incorporate it manually @@ -49,8 +48,8 @@ estimate_did_bp <- function(dt_did, covvars, p, logit_coef[is.na(logit_coef)|abs(logit_coef) > 1e10] <- 0 #put extreme value and na to 0 prop_score_fit <- fitted(prop_score_est) - if(max(prop_score_fit) >= 1){warning(paste0("support overlap condition violated for some group_time"))} - prop_score_fit <- pmin(1-1e-16, prop_score_fit) #for the ipw + if(max(prop_score_fit) >= 1-1e-10){warning(paste0("extreme propensity score: ", max(prop_score_fit), ", support overlap is likely to be violated"))} #<=0 (only in control) is fine for ATT since it is just not used + prop_score_fit <- pmin(1-1e-10, prop_score_fit) #for the ipw hess <- stats::vcov(prop_score_est) * n #for the influence function hess[is.na(hess)|abs(hess) > 1e10] <- 0 @@ -169,13 +168,11 @@ estimate_did_bp <- function(dt_did, covvars, p, inf_func <- rep(0, oldn) #the default needs to be 0 for the matrix multiplication inf_func_no_na <- inf_func_no_na * oldn / n #adjust the value such that mean over the whole id size give the right result inf_func[data_pos] <- inf_func_no_na - - return(list(att = att, inf_func = inf_func, logit_coef = logit_coef, #for next gt - cache_ps_fit = prop_score_fit, cache_hess = hess)) #for next outcome + + return(list(att = att, inf_func = inf_func, cache = list(ps = prop_score_fit, hess = hess))) #for next outcome } -estimate_did_rc <- function(dt_did, covvars, p, - last_coef = NULL, cache){ +estimate_did_rc <- function(dt_did, covvars, p, cache){ #TODO: skip if not enough valid data @@ -213,9 +210,9 @@ estimate_did_rc <- function(dt_did, covvars, p, #estimate the logit prop_score_est <- suppressWarnings(parglm.fit(covvars, dt_did[, D], - family = stats::binomial(), start = last_coef, + family = stats::binomial(), weights = dt_did[, weights*(inpre+inpost)*n/(n_pre+n_post)], #when seen in both pre and post have double weight - control = parglm.control(nthreads = getDTthreads()), + control = parglm.control(nthreads = ifelse(p$parallel, 1, getDTthreads())), intercept = FALSE)) #*(inpre+inpost) class(prop_score_est) <- "glm" #trick the vcov function to think that this is a glm object to dispatch the write method #const is implicitly put into the ipw formula, need to incorporate it manually @@ -248,20 +245,18 @@ estimate_did_rc <- function(dt_did, covvars, p, if(or){ - stop("or in RC should not be called right now") - control_bool_post <- dt_did[, D==0 & inpost] #control group and have obs in post period control_bool_pre <- dt_did[, D==0 & inpre] reg_coef_post <- stats::coef(stats::lm.wfit(x = covvars[control_bool_post,], y = dt_did[control_bool_post,post.y], w = dt_did[control_bool_post,weights])) - + reg_coef_pre <- stats::coef(stats::lm.wfit(x = covvars[control_bool_pre,], y = dt_did[control_bool_pre,pre.y], w = dt_did[control_bool_pre,weights])) - + if(anyNA(reg_coef_post)|anyNA(reg_coef_pre)){ stop("some outcome regression resulted in NA coefficients, likely cause by perfect colinearity") } - + #the control function from outcome regression dt_did[, or_delta_post := as.vector(tcrossprod(reg_coef_post, covvars))] dt_did[, or_delta_pre := as.vector(tcrossprod(reg_coef_pre, covvars))] @@ -317,27 +312,27 @@ estimate_did_rc <- function(dt_did, covvars, p, if(or){ - stop("or in RC should not be called right now") + M1_post <- colSums(dt_did[, inpost*treat_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wtpo M1_pre <- colSums(dt_did[, inpre*treat_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wtpr M3_post <- colSums(dt_did[, inpost*cont_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wcpo M3_pre <- colSums(dt_did[, inpre*cont_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wcpr - + or_x_post <- dt_did[, inpost*weights*(1-D)] * covvars or_x_pre <- dt_did[, inpre*weights*(1-D)] * covvars or_ex_post <- dt_did[, inpost*weights*(1-D)*(post.y - or_delta_post)] * covvars or_ex_pre <- dt_did[, inpre*weights*(1-D)*(pre.y - or_delta_pre)] * covvars XpX_post <- crossprod(or_x_post, covvars)/n_post XpX_pre <- crossprod(or_x_pre, covvars)/n_pre - + #calculate alrw = eX (XpX)^-1 by solve XpX*alrw = ex, much faster since avoided inv asym_linear_or_post <- t(solve(XpX_post, t(or_ex_post))) asym_linear_or_pre <- t(solve(XpX_pre, t(or_ex_pre))) - + #or for treat inf_treat_or_post <- -asym_linear_or_post %*% M1_post #a negative sign here, since or_delta is subtracted from the att, THE PROBLEM inf_treat_or_pre <- -asym_linear_or_pre %*% M1_pre - + #or for control inf_cont_or_post <- -asym_linear_or_post %*% M3_post inf_cont_or_pre <- -asym_linear_or_pre %*% M3_pre @@ -377,7 +372,12 @@ estimate_did_rc <- function(dt_did, covvars, p, inf_func <- rep(0, oldn) #the default needs to be 0 for the matrix multiplication inf_func[data_pos] <- inf_func_no_na_post - inf_func_no_na_pre - return(list(att = att, inf_func = inf_func, logit_coef = logit_coef, #for next gt - cache_ps_fit = prop_score_fit, cache_hess = hess)) #for next outcome + return(list(att = att, inf_func = inf_func, cache = list(ps = prop_score_fit, hess = hess))) #for next outcome +} + +# utilities ------ + +reverse_col <- function(x){ + return(x[,ncol(x):1]) } diff --git a/R/estimate_gtatt.R b/R/estimate_gtatt.R index 8de32d4..e5d2acb 100644 --- a/R/estimate_gtatt.R +++ b/R/estimate_gtatt.R @@ -11,62 +11,78 @@ estimate_gtatt <- function(aux, p){ return(outcome_results) } +#gtatt for each outcome estimate_gtatt_outcome <- function(y, aux, p, caches) { - - last_coef <- NULL - gt <- data.table() - gt_att <- c() - gt_inf_func <- data.table(placeholder = rep(NA, aux$id_size)) #populate with NA - for(t in aux$time_periods){ - for(g in aux$cohorts){ - - # preparation ----------------------- - - gt_name <- paste0(g,".",t) - - #determine the 2x2 - base_period <- get_base_period(g,t,p) - did_setup <- get_did_setup(g, t, base_period, aux, p) - if(is.null(did_setup)){next} #no gtatt if no did setup - - #covariates matrix - covvars <- get_covvars(base_period, t, aux, p) - - #the 2x2 dataset - cohort_did <- data.table(did_setup, y[[t]], y[[base_period]], aux$weights) - names(cohort_did) <- c("D", "post.y", "pre.y", "weights") - - # estimate -------------------- - result <- tryCatch(estimate_did(dt_did = cohort_did, covvars, p, - last_coef, caches[[gt_name]]), - error = function(e){stop("DiD estimation failed for group-", recover_time(g, p$time_offset, p$time_step) , - " time-", recover_time(t, p$time_offset, p$time_step), ": ", e)}) - - # post process -------------------- - - #collect the result - gt <- rbind(gt, data.table(G = g, time = t)) #the sequence matters for the weights - gt_att <- c(gt_att, att = result$att) - gt_inf_func[[gt_name]] <- result$inf_func - - #caches - last_coef <- result$logit_coef #for faster convergence in next iter - if(is.null(caches[[gt_name]])){caches[[gt_name]] <- result$cache_fit} - - rm(result) - - } + treated_cohort <- aux$cohorts[!is.infinite(ming(aux$cohorts))] #otherwise would try to calculate the pre-period of nevertreated in varying base period lol + gt_all <- expand.grid(g = treated_cohort, t = aux$time_periods, stringsAsFactors = FALSE) |> transpose() |> as.list() #first loop t then g + + #main estimation + if(!p$parallel){ + gt_results <- lapply(gt_all, estimate_gtatt_outcome_gt, y, aux, p, caches) + } else { + gt_results <- mclapply(gt_all, estimate_gtatt_outcome_gt, y, aux, p, caches) } + + #post process + gt_results <- gt_results[which(!sapply(gt_results, is.null))] #remove the ones with no valid didsetup + if(length(gt_results) == 0){stop("no valid group-times att to compute")} + + gt <- lapply(gt_results, function(x) {x$gt}) |> as.data.table() |> transpose() + names(gt) <- c("G", "time") + gt[, time := as.integer(time)] + + gt_att <- lapply(gt_results, function(x) {x$result$att}) + gt_inf_func <- lapply(gt_results, function(x) {x$result$inf_func}) + caches <- lapply(gt_results, function(x) {x$result$cache}) + + gt_names <- gt[,paste0(G,".",time)] + names(gt_att) <- gt_names + names(gt_inf_func) <- gt_names + names(caches) <- gt_names + + gt_inf_func <- do.call(cbind, gt_inf_func) + gt_att <- do.call(cbind, gt_att) |> t() + + return(list(est = list(gt = gt, att = gt_att, inf_func = gt_inf_func), caches = caches)) +} + +#gtatt for each outcome, each gt +estimate_gtatt_outcome_gt <- function(gt, y, aux, p, caches){ - gt_inf_func[,placeholder := NULL] - names(gt_att) <- names(gt_inf_func) - gt_inf_func <- as.matrix(gt_inf_func) + g <- gt[1] + t <- as.numeric(gt[2]) + + #find base time + gt_name <- paste0(g,".",t) + base_period <- get_base_period(g,t,p) + if(t == base_period | #no treatment effect for the base period + !base_period %in% aux$time_period){ #base period out of bounds + return(NULL) + } + #find treatment and control group + did_setup <- get_did_setup(g,t, base_period, aux, p) + valid_tc_groups <- any(did_setup == 1) & any(did_setup == 0) #if takes up too much time, consider use collapse anyv, but right now quite ok + if(!isTRUE(valid_tc_groups)){return(NULL)} #no treatment group or control group #isTRUE for na as false + + #covariates matrix + covvars <- get_covvars(base_period, t, aux, p) + + #the 2x2 dataset + cohort_did <- data.table(did_setup, y[[t]], y[[base_period]], aux$weights) + names(cohort_did) <- c("D", "post.y", "pre.y", "weights") + + # estimate -------------------- + result <- tryCatch(estimate_did(dt_did = cohort_did, covvars, p, caches[[gt_name]]), + error = function(e){stop("2-by-2 DiD failed for internal group-time ",g , + "-", t, ": ", e)}) + return(list(gt = gt, result = result)) - return(list(est = list(gt = gt, att = gt_att, inf_func = gt_inf_func), caches = caches)) } + get_base_period <- function(g,t,p){ + g <- ming(g) #for two period if(p$base_period == "universal"){ base_period <- g-1-p$anticipation } else { @@ -77,54 +93,41 @@ get_base_period <- function(g,t,p){ get_did_setup <- function(g, t, base_period, aux, p){ - treated_cohorts <- aux$cohorts[!is.infinite(aux$cohorts)] - - #get the range of cohorts - if(p$control_option == "never"){ - min_control_cohort <- Inf - } else { - min_control_cohort <- max(t, base_period)+p$anticipation+1 - } - max_control_cohort <- ifelse(p$control_option == "notyet", max(treated_cohorts), Inf) - - #experimental + treated_cohorts <- aux$cohorts[!is.infinite(ming(aux$cohorts))] + + min_control_cohort <- ifelse(p$control_option == "never", Inf, max(t, base_period)+p$anticipation+1) + max_control_cohort <- ifelse(p$control_option == "notyet", max(ming(treated_cohorts)), Inf) + if(!is.na(p$exper$max_control_cohort_diff)){ - max_control_cohort <- min(g+p$exper$max_control_cohort_diff, max(treated_cohorts)) - } - if(!is.na(p$exper$min_control_cohort_diff)){ - min_control_cohort <- max(g+p$exper$min_control_cohort_diff, min(treated_cohorts)) - } - if((!is.na(p$exper$max_dynamic))){ - if(t-g > p$exper$max_dynamic){return(NULL)} - } - if(!is.na(p$exper$min_dynamic)){ - if(t-g < p$exper$min_dynamic){return(NULL)} - } - - # invalid gt - if(t == base_period | #no treatment effect for the base period - base_period < min(aux$time_periods) | #no treatment effect for the first period, since base period is not observed - g >= max_control_cohort | #no treatment effect for never treated or the last treated cohort (for not yet notyet) - t >= max_control_cohort | #no control available if the last cohort is treated too - min_control_cohort > max_control_cohort){ #no control avalilble, most likely due to anticipation - return(NULL) + max_control_cohort <- min(g+p$exper$max_control_cohort_diff, max_control_cohort) } #select the control and treated cohorts did_setup <- rep(NA, aux$id_size) - did_setup[get_cohort_pos(aux$cohort_sizes, min_control_cohort, max_control_cohort)] <- 0 - did_setup[get_cohort_pos(aux$cohort_sizes, g)] <- 1 #treated cannot be controls, assign treated after control to overwrite + did_setup[get_control_pos(aux$cohort_sizes, min_control_cohort, max_control_cohort)] <- 0 + did_setup[get_treat_pos(aux$cohort_sizes, g)] <- 1 #treated cannot be controls, assign treated after control to overwrite if(!is.na(p$exper$filtervar)){ did_setup[!aux$filters[[base_period]]] <- NA #only use units with filter == TRUE at base period + + } + if(!is.na(p$exper$filtervar_post)){ + did_setup[!aux$filters[[t]]] <- NA #only use units with filter == TRUE at target period } return(did_setup) } -get_cohort_pos <- function(cohort_sizes, start_cohort, end_cohort = start_cohort){ - start <- cohort_sizes[G < start_cohort, sum(cohort_size)]+1 - end <- cohort_sizes[G <= end_cohort, sum(cohort_size)] +get_control_pos <- function(cohort_sizes, start_cohort, end_cohort = start_cohort){ + start <- cohort_sizes[ming(G) < start_cohort, sum(cohort_size)]+1 + end <- cohort_sizes[ming(G) <= end_cohort, sum(cohort_size)] + return(start:end) +} + +get_treat_pos <- function(cohort_sizes, treat_cohort){ #need to separate for double did, need to match exact g-g-t + index <- which(cohort_sizes[,G] == treat_cohort) + start <- ifelse(index == 1, 1, cohort_sizes[1:(index-1), sum(cohort_size)]+1) + end <- cohort_sizes[1:index, sum(cohort_size)] return(start:end) } diff --git a/R/fastdid.R b/R/fastdid.R index 04111da..19c91eb 100644 --- a/R/fastdid.R +++ b/R/fastdid.R @@ -1,64 +1,61 @@ #' Fast Staggered DID Estimation #' -#' Performs Difference-in-Differences (DID) estimation fast. +#' Performs Difference-in-Differences (DID) estimation. #' -#' @param data A data.table containing the panel data. -#' @param timevar The name of the time variable. -#' @param cohortvar The name of the cohort (group) variable. -#' @param unitvar The name of the unit (id) variable. -#' @param outcomevar The name of the outcome variable. -#' @param control_option The control units used for the DiD estimates. Default is "both". -#' @param result_type A character string indicating the type of result to be returned. Default is "group_time". -#' @param balanced_event_time A numeric scalar that indicates the max event time to balance the cohort composition, only meaningful when result_type == "dynamic". Default is NA -#' @param control_type The method for controlling for covariates. "ipw" for inverse probability weighting, "reg" for outcome regression, or "dr" for doubly-robust -#' @param allow_unbalance_panel Whether allow unbalance panel as input (if false will coerce the dataset to a balanced panel). Default is FALSE -#' @param boot Logical, indicating whether bootstrapping should be performed. Default is FALSE -#' @param biters The number of bootstrap iterations. Only relevant if boot = TRUE. Default is 1000. -#' @param cband Logical, indicate whether to use uniform confidence band or point-wise, defulat is FALSE (use point-wise) -#' @param alpha The significance level, default is 0.05 -#' @param weightvar The name of the weight variable, if not specified will cluster on unit level (optional). -#' @param clustervar The name of the cluster variable, can only be used when boot == TRUE (optional). -#' @param covariatesvar A character vector containing the names of time-invariant covariate variables (optional). -#' @param varycovariatesvar A character vector containing the names of time-varying covariate variables (optional). -#' @param copy whether to copy the dataset before processing, set to false to speed up the process, but the input data will be altered. -#' @param validate whether to validate the dataset before processing. -#' @param anticipation periods with aniticipation (delta in CS, default is 0, reference period is g - delta - 1). -#' @param exper the list of experimental features, for features that are not in CSDID originally. Generally less tested. -#' @param base_period same as did +#' @param data data.table, the dataset. +#' @param timevar character, name of the time variable. +#' @param cohortvar character, name of the cohort (group) variable. +#' @param unitvar character, name of the unit (id) variable. +#' @param outcomevar character vector, name(s) of the outcome variable(s). +#' @param control_option character, control units used for the DiD estimates, options are "both", "never", or "notyet". +#' @param result_type character, type of result to return, options are "group_time", "time", "group", "simple", "dynamic" (time since event), "group_group_time", or "dynamic_stagger". +#' @param balanced_event_time number, max event time to balance the cohort composition. +#' @param control_type character, estimator for controlling for covariates, options are "ipw" (inverse probability weighting), "reg" (outcome regression), or "dr" (doubly-robust). +#' @param allow_unbalance_panel logical, allow unbalance panel as input or coerce dataset into one. +#' @param boot logical, whether to use bootstrap standard error. +#' @param biters number, bootstrap iterations. Default is 1000. +#' @param cband logical, whether to use uniform confidence band or point-wise. +#' @param alpha number, the significance level. Default is 0.05. +#' @param weightvar character, name of the weight variable. +#' @param clustervar character, name of the cluster variable. +#' @param covariatesvar character vector, names of time-invariant covariate variables. +#' @param varycovariatesvar character vector, names of time-varying covariate variables. +#' @param copy logical, whether to copy the dataset. +#' @param validate logical, whether to validate the dataset. +#' @param anticipation number, periods with anticipation. +#' @param exper list, arguments for experimental features. +#' @param base_period character, type of base period in pre-preiods, options are "universal", or "varying". +#' @param full logical, whether to return the full result (influence function, call, weighting scheme, etc,.). +#' @param parallel logical, whether to use parallization on unix system. +#' @param cohortvar2 character, name of the second cohort (group) variable. +#' @param event_specific logical, whether to recover target treatment effect or use combined effect. +#' @param double_control_option character, control units used for the double DiD, options are "both", "never", or "notyet". #' -#' @import data.table parglm stringr dreamerr BMisc -#' @importFrom stats quantile vcov sd binomial fitted qnorm rnorm as.formula +#' @import data.table stringr dreamerr ggplot2 +#' @importFrom stats quantile vcov sd binomial fitted qnorm rnorm as.formula weighted.mean #' @importFrom collapse allNA fnrow whichNA fnunique fsum na_insert -#' @return A data.table containing the estimated treatment effects and standard errors. +#' @importFrom parallel mclapply +#' @importFrom BMisc multiplier_bootstrap +#' @importFrom parglm parglm.fit parglm.control +#' @return A data.table containing the estimated treatment effects and standard errors or a list of all results when `full == TRUE`. #' @export #' +#' @details +#' `balanced_event_time` is only meaningful when `result_type == "dynamic`. +#' +#' `result_type` as `group-group-time` and `dynamic staggered` is only meaningful when using double did. +#' +#' `biter` and `clustervar` is only used when `boot == TRUE`. +#' #' @examples #' # simulated data -#' simdt <- sim_did(1e+03, 10, cov = "cont", second_cov = TRUE, second_outcome = TRUE) +#' simdt <- sim_did(1e+02, 10, cov = "cont", second_cov = TRUE, second_outcome = TRUE, seed = 1) #' dt <- simdt$dt #' #' #basic call #' result <- fastdid(data = dt, timevar = "time", cohortvar = "G", #' unitvar = "unit", outcomevar = "y", #' result_type = "group_time") -#' -#' #control for covariates -#' result2 <- fastdid(data = dt, timevar = "time", cohortvar = "G", -#' unitvar = "unit", outcomevar = "y", -#' result_type = "group_time", -#' covariatesvar = c("x", "x2")) -#' -#' #bootstrap and clustering -#' result3 <- fastdid(data = dt, timevar = "time", cohortvar = "G", -#' unitvar = "unit", outcomevar = "y", -#' result_type = "group_time", -#' boot = TRUE, clustervar = "x") -#' -#' #estimate for multiple outcomes -#' result4 <- fastdid(data = dt, #the dataset -#' timevar = "time", cohortvar = "G", unitvar = "unit", -#' outcomevar = c("y", "y2"), #name of the outcome columns -#' result_type = "group_time") #' #' @keywords difference-in-differences fast computation panel data estimation did fastdid <- function(data, @@ -68,9 +65,10 @@ fastdid <- function(data, weightvar=NA,clustervar=NA, covariatesvar = NA, varycovariatesvar = NA, copy = TRUE, validate = TRUE, anticipation = 0, base_period = "universal", - exper = NULL){ - - # validation -------------------------------------------------------- + exper = NULL, full = FALSE, parallel = FALSE, + cohortvar2 = NA, event_specific = TRUE, double_control_option="both"){ + + # preprocess -------------------------------------------------------- if(!is.data.table(data)){ warning("coercing input into a data.table.") @@ -82,240 +80,47 @@ fastdid <- function(data, p <- as.list(environment()) #collect everything besides data p$data <- NULL p$dt <- NULL - p$exper <- get_exper_default(p$exper) + + exper_args <- c("filtervar", "filtervar_post", "only_balance_2by2", + "aggregate_scheme", "max_control_cohort_diff") + p$exper <- get_exper_default(p$exper, exper_args) + class(p) <- "locked" #no more changes! validate_argument(dt, p) - - # validate and throw away not legal data + #change name for main columns setnames(dt, c(timevar, cohortvar, unitvar), c("time", "G", "unit")) - dt <- validate_dt(dt, p) + if(!is.na(p$cohortvar2)){setnames(dt, p$cohortvar2, "G2")} - # preprocess ----------------------------------------------------------- + # validate and throw away not legal data + dt <- validate_dt(dt, p) #make dt conform to the WLOG assumptions of fastdid coerce_result <- coerce_dt(dt, p) #also changed dt - dt <- coerce_result$dt - p <- coerce_result$p # get auxiliary data - aux <- get_auxdata(dt, p) - - # main part ------------------------------------------------- + aux <- get_auxdata(coerce_result$dt, p) - gt_result_list <- estimate_gtatt(aux, p) + # main estimation ------------------------------------------------- - all_result <- aggregate_gt(gt_result_list, aux, p) - - #convert "targets" back to meaningful parameter identifiers like cohort 1 post, time 2 post - all_result <- convert_targets(all_result, p) - - return(all_result) - -} - -# small steps ---------------------------------------------------------------------- - -get_exper_default <- function(exper){ - exper_args <- c("filtervar", "min_dynamic", "max_dynamic", "min_control_cohort_diff", "max_control_cohort_diff") - for(arg in exper_args){ - if(is.null(exper[[arg]])){ - exper[[arg]] <- NA - } - } - return(exper) -} - -coerce_dt <- function(dt, p){ - - - #chcek if there is availble never-treated group - if(!is.infinite(dt[, max(G)]) & p$control_option != "notyet"){ - warning("no never-treated availble, switching to not-yet-treated control") - p$control_option <- "notyet" - } - - if(p$allow_unbalance_panel){ - dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] - setorder(dt_inv_raw, G) - dt_inv_raw[, new_unit := 1:.N] #let unit start from 1 .... N, useful for knowing which unit is missing - dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit") - dt[, unit := new_unit] - } - - setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access - #deal with time, coerice time to 1,2,3,4,5....... - time_periods <- dt[, unique(time)] - time_size <- length(time_periods) - - time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) - if(time_offset != 0){ - dt[, G := G-time_offset] - dt[, time := time-time_offset] - time_periods <- time_periods - time_offset - } - - time_step <- 1 #time may not jump at 1 - if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ - time_step <- time_periods[2]-time_periods[1] - time_periods <- (time_periods-1)/time_step+1 - if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} - dt[G != 1, G := (G-1)/time_step+1] - dt[time != 1, time := (time-1)/time_step+1] - } - - #add the information to p - p$time_step = time_step - p$time_offset = time_offset - - if(nrow(dt) == 0){ - stop("no data after coercing the dataset") - } - - return(list(dt = dt, p = p)) - -} - -get_auxdata <- function(dt, p){ - - time_periods <- dt[, unique(time)] - id_size <- dt[, uniqueN(unit)] - - setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access - - #construct the outcomes list for fast access later - #loop for multiple outcome - outcomes_list <- list() - for(outcol in p$outcomevar){ - outcomes <- list() - - if(!p$allow_unbalance_panel){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - outcomes[[i]] <- dt[seq(start,end), get(outcol)] - } - } else { - - for(i in time_periods){ - #populate a outcome vector of length N with outcome data in the right place - #NA is data gone or missing, will be addressed in estimate_did_rc - outcome_period <- rep(NA, id_size) - data_pos <- dt[time == i, unit] #units observed in i - outcome_period[data_pos] <- dt[time == i, get(outcol)] - outcomes[[i]] <- outcome_period - } - - } - - outcomes_list[[outcol]] <- outcomes - } - - #the time-invariant parts - if(!p$allow_unbalance_panel){ - dt_inv <- dt[1:id_size] - } else { - dt_inv <- dt[dt[, .I[1], by = unit]$V1] #the first observation - setorder(dt_inv, unit) #can't move this outside - } - - cohorts <- dt_inv[, unique(G)] - cohort_sizes <- dt_inv[, .(cohort_size = .N) , by = G] - - # the optional columns - varycovariates <- list() - if(!allNA(p$varycovariatesvar)){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - varycovariates[[i]] <- dt[seq(start,end), .SD, .SDcols = p$varycovariatesvar] - } - } else { - varycovariates <- NA - } - - # filters - filters <- list() - if(!is.na(p$exper$filtervar)){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - filters[[i]] <- dt[seq(start,end), .SD, .SDcols = p$exper$filtervar] - } - } else { - filters <- NA - } + gt_result_list <- estimate_gtatt(aux, p) + agg_result <- aggregate_gt(gt_result_list, aux, p) - if(!allNA(p$covariatesvar)){ - covariates <- dt_inv[,.SD, .SDcols = p$covariatesvar] - } else { - covariates <- NA - } + # post process ------------------------------------------- - if(!is.na(p$clustervar)){ - cluster <- dt_inv[, .SD, .SDcols = p$clustervar] |> unlist() - } else { - cluster <- NA - } + #convert "targets" back to meaningful parameters + est_results <- convert_targets(agg_result$est, p, coerce_result$t) - if(!is.na(p$weightvar)){ - weights <- dt_inv[, .SD, .SDcols = p$weightvar] |> unlist() + if(!p$full){ + return(est_results) } else { - weights <- rep(1, id_size) + full_result <- list( + call = p, + estimate = est_results, + gt_estimate = gt_result_list, + agg_inf_func = agg_result$inf_func, + agg_weight_matrix = agg_result$agg_weight_matrix + ) + class(full_result) <- c("fastdid_result", class(full_result)) + return(full_result) } - - aux <- list(time_periods = time_periods, - id_size = id_size, - outcomes_list = outcomes_list, - dt_inv= dt_inv, - cohorts = cohorts, - cohort_sizes = cohort_sizes, - varycovariates = varycovariates, - covariates = covariates, - cluster = cluster, - weights = weights, - filters = filters) - - return(aux) - -} - -convert_targets <- function(results, p){ - - result_type <- p$result_type - - if(result_type == "dynamic"){ - setnames(results, "target", "event_time") - - } else if (result_type == "cohort"){ - - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := recover_time(abs(target), p$time_offset, p$time_step)] - setnames(results, "target", "cohort") - - } else if (result_type == "calendar"){ - - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := recover_time(abs(target), p$time_offset, p$time_step)] - setnames(results, "target", "time") - - } else if (result_type == "group_time"){ - - results[, cohort := as.numeric(str_split_i(target, "\\.", 1))] - results[, time := as.numeric(str_split_i(target, "\\.", 2))] - - #recover the time - results[, cohort := recover_time(cohort, p$time_offset, p$time_step)] - results[, time := recover_time(time, p$time_offset, p$time_step)] - - results[, target := NULL] - - } else if (result_type == "simple") { - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := NULL] - } - return(results) -} - -recover_time <- function(time, time_offset, time_step){ - return(((time-1)*time_step)+1+time_offset) } \ No newline at end of file diff --git a/R/generics.R b/R/generics.R new file mode 100644 index 0000000..014a663 --- /dev/null +++ b/R/generics.R @@ -0,0 +1,57 @@ +#' Plot event study +#' +#' Plot event study results. +#' +#' @param x A data table generated with [fastdid] with one-dimensional index. +#' @param margin character, the x-axis of the plot +#' +#' @return A ggplot2 object +#' @examples +#' +#' # simulated data +#' simdt <- sim_did(1e+02, 10, seed = 1) +#' dt <- simdt$dt +#' +#' #estimation +#' result <- fastdid(data = dt, timevar = "time", cohortvar = "G", +#' unitvar = "unit", outcomevar = "y", +#' result_type = "dynamic") +#' +#' #plot +#' plot_did_dynamics(result) +#' +#' @export +plot_did_dynamics <-function(x, margin = "event_time"){ + + #find the base_period + if(margin == "event_time"){ + et_range <- min(x[, event_time]):max(x[, event_time]) + base_time <- et_range[!et_range %in% x[, unique(event_time)]] + if(length(base_time)!=1){stop("missing more than one period")} + + #add the base period + if("outcome" %in% names(x)){ + base_row <- data.table(att = 0, se = 0, event_time = base_time, outcome = x[, unique(outcome)], att_ciub = 0, att_cilb = 0) + } else { + base_row <- data.table(att = 0, se = 0, event_time = base_time, att_ciub = 0, att_cilb = 0) + } + x <- x |> rbind(base_row, fill = TRUE) + } else { + x <- x[type == "post"] + } + + plot <- x |> + ggplot() + + geom_hline(yintercept = 0, linetype = 'dashed', col = 'red') + + geom_point( aes(x = eval(str2lang(margin)), y = att), color = "black") + #point est + geom_errorbar( aes(x = eval(str2lang(margin)), ymin = att_cilb, ymax = att_ciub), + width = 0.1, linetype = "dashed") + #CI + labs(x = margin) + + if(margin == "event_time"){ + plot <- plot + geom_line( aes(x = eval(str2lang(margin)), y = att), color = "black") #point est + } + + return(plot) + +} \ No newline at end of file diff --git a/R/global.R b/R/global.R index f54e790..b065fc4 100644 --- a/R/global.R +++ b/R/global.R @@ -9,5 +9,8 @@ utils::globalVariables(c('.','agg_weight','att','att_cont','att_treat','attgt',' 'V1','att_cont_post','att_cont_pre','att_treat_post','att_treat_pre','inpost','inpre','max_et','min_et','new_unit','or_delta','or_delta_post','or_delta_pre','targeted','used', "timevar", "cohortvar", "unitvar", "outcomevar", "control_option", "result_type", "balanced_event_time", "control_type", "allow_unbalance_panel", "boot", "biters", "weightvar", "clustervar", "covariatesvar", "varycovariatesvar", "filtervar", - "copy", "validate", "max_control_cohort_diff", "anticipation", "min_control_cohort_diff", "base_period", "post", "att_ciub", "att_cilb", "cband", "alpha")) + "copy", "validate", "max_control_cohort_diff", "anticipation", "min_control_cohort_diff", "base_period", "post", "att_ciub", "att_cilb", "cband", "alpha", + "G2", "G1", "mg", "cohort1", "cohort2", "event_time_1", "event_time_2", + "D2", "attgt2", "event", "atu2", "y01", "y10", "y11", "tau2", "parallel", + "tp", "cp", "tb", "cb", "no_na", "event_stagger", "double_control_option")) diff --git a/R/plot_event_dynamics.R b/R/plot_event_dynamics.R deleted file mode 100644 index 2a10919..0000000 --- a/R/plot_event_dynamics.R +++ /dev/null @@ -1,56 +0,0 @@ -#' Create an event study plot for Difference-in-Differences (DiD) analysis. -#' -#' This function generates an event study plot based on the results of a DiD analysis. -#' -#' @param dt A data table containing the results of the DiD analysis. It should include columns for 'att' (average treatment effect), 'se' (standard error), and 'event_time' (time points). -#' @param graphname A character string specifying the title of the plot (default is "event study plot"). -#' @param note A character string for adding additional notes or comments to the plot (default is empty). -#' -#' @return A ggplot2 object representing the event study plot. -#' @export - -plot_did_dynamics <-function(dt, - graphname = "event study plot", note = "" -){ - - if (!requireNamespace("ggplot2", quietly = TRUE)) { - warning("The ggplot2 package must be installed to use plotting functions") - #Either exit or do something without rgl - return(NULL) - } - - - #find the base_period - et_range <- min(dt[, event_time]):max(dt[, event_time]) - base_time <- et_range[!et_range %in% dt[, unique(event_time)]] - if(length(base_time)!=1){stop("missing more then one period")} - - #add the base period - if("outcome" %in% names(dt)){ - base_row <- data.table(att = 0, se = 0, event_time = base_time, outcome = dt[, unique(outcome)], att_ciub = 0, att_cilb = 0) - } else { - base_row <- data.table(att = 0, se = 0, event_time = base_time, att_ciub = 0, att_cilb = 0) - } - - #dt <- dt |> add_base_row(base_row) - dt <- dt |> rbind(base_row, fill = TRUE) - - - figure <- dt |> - ggplot2::ggplot() + - ggplot2::geom_hline(yintercept = 0, linetype = 'dashed', col = 'red') - - figure <- figure + ggplot2::geom_line(ggplot2::aes(x = event_time, y = att), color = "black") + - ggplot2::geom_point(ggplot2::aes(x = event_time, y = att), color = "black") + - ggplot2::geom_errorbar(ggplot2::aes(x = event_time, ymin = att_cilb, ymax = att_ciub), - width = 0.1, linetype = "dashed") - - if("outcome" %in% names(dt)){ - figure <- figure + ggplot2::facet_wrap(~outcome, scales = "free") - } - - figure <- figure + ggplot2::theme_classic() - - return(figure) - -} diff --git a/R/sim_did.R b/R/sim_did.R index 8a43e13..85b2a03 100644 --- a/R/sim_did.R +++ b/R/sim_did.R @@ -16,6 +16,9 @@ #' @param stratify Whether to stratify the dataset based on a binary covariate. #' @param treatment_assign The method for treatment assignment ("latent" or "uniform"). #' @param vary_cov include time-varying covariates +#' @param second_cohort include confounding events +#' @param confound_ratio extent of event confoundedness +#' @param second_het heterogeneity of the second event #' #' @return A list containing the simulated dataset (dt) and the treatment effect values (att). #' @@ -23,13 +26,10 @@ #' # Simulate a DiD dataset with default settings #' data <- sim_did(sample_size = 100, time_period = 5) #' -#' # Simulate a DiD dataset with customized settings -#' data <- sim_did(sample_size = 200, time_period = 8, cov = "int", hetero = "dynamic") -#' #' @export sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size = 0.001, cov = "no", hetero = "all", second_outcome = FALSE, second_cov = FALSE, vary_cov = FALSE, na = "none", - balanced = TRUE, seed = NA, stratify = FALSE, treatment_assign = "latent"){ + balanced = TRUE, seed = NA, stratify = FALSE, treatment_assign = "latent", second_cohort = FALSE, confound_ratio = 1, second_het = "all"){ if(!is.na(seed)){set.seed(seed)} @@ -58,7 +58,8 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size #assign treated group based on a latent related to X if(treatment_assign == "latent"){ - dt_i[, treat_latent := x*0.2 + x2*0.2 + rnorm(sample_size)] #unit with larger X tend to be treated and treated earlier + ep1 <- rnorm(sample_size) + dt_i[, treat_latent := x*0.2 + x2*0.2 + ep1] #unit with larger X tend to be treated and treated earlier untreated_thres <- quantile(dt_i$treat_latent, untreated_prop) dt_i[treat_latent <= untreated_thres, G := Inf] #unit with low latent is never treated @@ -75,6 +76,28 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size dt_i[,G := floor((unit-1)/(sample_size/time_period))] dt_i[G < 2, G := Inf] } + + if(second_cohort){ + setnames(dt_i, "G", "G2") + if(treatment_assign == "latent"){ + dt_i[, treat_latent := x*0.2 + x2*0.2 + ep1*confound_ratio + rnorm(sample_size)] #unit with larger X tend to be treated and treated earlier + untreated_thres <- quantile(dt_i$treat_latent, untreated_prop) + dt_i[treat_latent <= untreated_thres, G := Inf] #unit with low latent is never treated + + cohort_prop <- (1-untreated_prop)/(time_period-1) + last_treat_thres <- untreated_thres + for(t in time_period:2){ #unit start getting treated in t = 2 + treat_thres <- quantile(dt_i$treat_latent, untreated_prop + cohort_prop*(time_period - t + 1)) + dt_i[treat_latent <= treat_thres & treat_latent > last_treat_thres, G := t] + last_treat_thres <- treat_thres + } + rm(t) + } else if (treatment_assign == "uniform"){ + #when treatment is set to 'uniform', untreated propensity is fixed + dt_i[,G := floor((unit-1)/(sample_size/time_period))] + dt_i[G < 2, G := Inf] + } + } #assign unit FE dt_i[, unit_fe := rnorm(sample_size)] @@ -90,11 +113,9 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size dt <- dt |> merge(dt_i, by = "unit") dt <- dt |> merge(dt_t, by = "time") - dt[, D := as.integer(time >= G)] - #add time_varying covariates if(vary_cov){ - dt[, xvar := pmin(G, time_period+4)*time+rnorm(sample_size*time_period, 0,30)] #should be confounding....? + dt[, xvar := pmin(G, time_period+4)*time^(1/3)*0.1+rnorm(sample_size*time_period, 0,10)] #should be confounding....? } else { dt[, xvar := 1] } @@ -102,6 +123,7 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size #untreated potential outcomes dt[, y0 := unit_fe + time_fe + (x+x2)*x_trend + xvar + rnorm(sample_size*time_period, sd = epsilon_size)] + dt[, D := as.integer(time >= G)] #generate gtatt att <- CJ(G = 1:time_period, time = 1:time_period) if(hetero == "all"){ @@ -112,15 +134,51 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size } } att[time < G, attgt := 0] #no anticipation - dt <- dt |> merge(att, by = c("time", "G"), all.x = TRUE, all.y = FALSE) dt[is.na(attgt), attgt := 0] dt[, tau := attgt*s] - #potential outcome - dt[, y1 := y0 + tau] - dt[, y := y1*D + y0*(1-D)] - dt <- dt[, .SD, .SDcols = c("time", "G", "unit", "x", "x2", "y", "s", "xvar")] + if(second_cohort){ + dt[, D2 := as.integer(time >= G2)] + #generate gtatt + att2 <- CJ(G2 = 1:time_period, time = 1:time_period) + if(second_het == "no"){ + att2[, attgt2 := 10] + } else { + if(hetero == "all"){ + att2[, attgt2 := rnorm(time_period*time_period, mean = 2, sd = 1)] + } else if (hetero == "dynamic"){ + for(event_t in 0:max(att2[,time-G2])){ + att2[time - G2 == event_t, attgt2 := rnorm(1, mean = 2, sd = 1)] + } + } + } + att2[time < G2, attgt2 := 0] #no anticipation + + #add att2 to att + att[, event := 1] + att2[, event := 2] + att <- rbind(att, att2, fill = TRUE) + + dt <- dt |> merge(att2, by = c("time", "G2"), all.x = TRUE, all.y = FALSE) + dt[is.na(attgt2), attgt2 := 0] + dt[, tau2 := attgt2*s] + + #potential outcome + dt[, y10 := y0 + tau] + dt[, y01 := y0 + tau2] + dt[, y11 := y0 + tau + tau2] + dt[, y := y0*(1-D)*(1-D2)+ y10*D*(1-D2) + y01*(1-D)*D2 + y11*D*D2] + cols <- c("time", "G", "G2", "unit", "x", "x2", "y", "s", "xvar") + + } else { + #potential outcome + dt[, y1 := y0 + tau] + dt[, y := y1*D + y0*(1-D)] + cols <- c("time", "G", "unit", "x", "x2", "y", "s", "xvar") + } + + dt <- dt[, .SD, .SDcols = cols] #additional ----------------- diff --git a/R/utils.R b/R/utils.R deleted file mode 100644 index 2a855a5..0000000 --- a/R/utils.R +++ /dev/null @@ -1,8 +0,0 @@ -set_max_thread <- function(){ - data.table::setDTthreads(0) - options(kit.nThread = getDTthreads()) -} - -reverse_col <- function(x){ - return(x[,ncol(x):1]) -} diff --git a/R/validate.R b/R/validate.R index b865db3..862e105 100644 --- a/R/validate.R +++ b/R/validate.R @@ -21,22 +21,22 @@ validate_argument <- function(dt, p){ checkvar_message <- "__ARG__ must be NA or a character scalar if a name of columns from the dataset." check_set_arg(weightvar, clustervar, "NA | match", .choices = dt_names, .message = checkvar_message, .up = 1) - check_set_arg(control_option, "match", .choices = c("both", "never", "notyet"), .up = 1) #kinda bad names since did's notyet include both notyet and never + check_set_arg(control_option, double_control_option, "match", .choices = c("both", "never", "notyet"), .up = 1) #kinda bad names since did's notyet include both notyet and never check_set_arg(control_type, "match", .choices = c("ipw", "reg", "dr"), .up = 1) check_set_arg(base_period, "match", .choices = c("varying", "universal"), .up = 1) - check_arg(copy, validate, boot, allow_unbalance_panel, cband, "scalar logical", .up = 1) + check_arg(copy, validate, boot, allow_unbalance_panel, cband, parallel, "scalar logical", .up = 1) check_arg(anticipation, alpha, "scalar numeric", .up = 1) if(!is.na(balanced_event_time)){ if(result_type != "dynamic"){stop("balanced_event_time is only meaningful with result_type == 'dynamic'")} check_arg(balanced_event_time, "numeric scalar", .up = 1) } - if(allow_unbalance_panel == TRUE & control_type %in% c("dr", "reg")){ - stop("fastdid currently only supprts ipw when allowing for unbalanced panels.") - } - if(allow_unbalance_panel == TRUE & !allNA(varycovariatesvar)){ - stop("fastdid currently only supprts time varying covariates when allowing for unbalanced panels.") + if(allow_unbalance_panel == TRUE & control_type == "dr"){ + stop("fastdid does not support DR when allowing for unbalanced panels.") } + # if(allow_unbalance_panel == TRUE & !allNA(varycovariatesvar)){ + # stop("fastdid currently only supprts time varying covariates when not allowing for unbalanced panels.") + # } if(any(covariatesvar %in% varycovariatesvar) & !allNA(varycovariatesvar) & !allNA(covariatesvar)){ stop("time-varying var and invariant var have overlaps.") } @@ -44,6 +44,11 @@ validate_argument <- function(dt, p){ stop("clustering and uniform confidence interval only available with bootstrap") } + if(parallel){ + if(.Platform$OS.type != "unix"){stop("parallel option only available on unix sysytems")} + if(!requireNamespace("parallel")){stop("parallel requires the parallel package")} + } + # varname collision varnames <- unlist(p[str_subset(names(p), "var")]) varnames <- varnames[!is.na(varnames)] @@ -53,7 +58,7 @@ validate_argument <- function(dt, p){ validate_dt <- function(dt, p){ varnames <- unlist(p[str_ends(names(p), "var")], recursive = TRUE) #get all the argument that ends with "var" - varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar)] + varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar) & !is.na(varnames) & !is.null(varnames)] #change to int uniquecols <- c("G", "time", "unit") @@ -69,13 +74,14 @@ validate_dt <- function(dt, p){ if(p$balanced_event_time > dt[, max(time-G)]){stop("balanced_event_time is larger than the max event time in the data")} } - #doesn't allow missing value for now - for(col in varnames){ - if(is.na(col)){next} - na_obs <- whichNA(dt[[col]]) - if(length(na_obs) != 0){ - warning("missing values detected in ", col, ", removing ", length(na_obs), " observation.") - dt <- dt[!na_obs] + #doesn't allow missing value + if(is.na(p$exper$only_balance_2by2) | !p$exper$only_balance_2by2){ + for(col in varnames){ + na_obs <- whichNA(dt[,get(col)]) + if(length(na_obs) != 0){ + warning("missing values detected in ", col, ", removing ", length(na_obs), " observation.") + dt <- dt[!na_obs] + } } } @@ -89,6 +95,7 @@ validate_dt <- function(dt, p){ if(is.na(cov)){next} #check covaraites is not constant if(fnunique(dt[, get(cov)[1], by = "unit"][, V1]) == 1)stop(cov, " have no variation") + if(!(is.numeric(dt[, get(cov)])|is.integer(dt[, get(cov)]))){stop(cov, " is not numeric or integer, do not support fixed effects.")} } } @@ -109,6 +116,24 @@ validate_dt <- function(dt, p){ } } + # drop always_treated units + always_treated <- dt[G <= min(time), unique(unit)] + if(length(always_treated) > 0){ + warning(length(always_treated), " units is treated in the first period, dropping them") + dt <- dt[!unit %in% always_treated] + } + + #for double did part + if(!is.na(p$cohortvar2)){ + always_treated <- dt[G2 <= min(time), unique(unit)] + if(length(always_treated) > 0){ + warning(length(always_treated), " units is treated in the first period, dropping them") + dt <- dt[!unit %in% always_treated] + } + } + + + return(dt) } \ No newline at end of file diff --git a/README.md b/README.md index 512ef95..56dc457 100644 --- a/README.md +++ b/README.md @@ -1,197 +1,42 @@ -# fastdid - fast Difference-in-Differences +# fastdid [![R-CMD-check](https://github.com/TsaiLintung/fastdid/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/TsaiLintung/fastdid/actions/workflows/R-CMD-check.yaml) + [![codecov](https://codecov.io/github/TsaiLintung/fastdid/graph/badge.svg?token=0EYF1DIBRF)](https://codecov.io/github/TsaiLintung/fastdid) -**fastdid** is a lightning-fast implementation of [Callaway and Sant'Anna's (2021)](https://www.sciencedirect.com/science/article/pii/S0304407620303948) staggered Difference-in-Differences (DiD) estimators. DiD setup for millions of units used to take hours to run. With **fastdid**, it takes seconds. +fastdid implements the Difference-in-Differences (DiD) estimators in [Callaway and Sant'Anna's (2021)](https://www.sciencedirect.com/science/article/pii/S0304407620303948), it is -To learn more about the staggered Difference-in-differences estimators implemented, visit Callaway and Sant'Anna's [website](https://bcallaway11.github.io/did/articles/did-basics.html). + - fast, reducing the computation time with millions of units from hours to [seconds](articles/misc.html), + - flexible, with extensions such as [time-varying covariates](https://arxiv.org/abs/2406.15288) and [multiple events](not_ready_yet) (coming soon!). -# Installation +# Getting Started -You can install **fastdid** from GitHub. +fastdid can be installed from GitHub. ``` # install.packages("devtools") devtools::install_github("TsaiLintung/fastdid") ``` -# Usage +To use `fastdid`, you need to provide the dataset `data`, the column name of time `timevar`, cohort `cohortvar`, unit `unitvar`, and outcome(s) `outcomevar`. Here is a simple call: -`fastdid` is the main function provided by **fastdid**. When using `fastdid`, you need to provide the dataset (`data`), specify the names of the relevant columns (`-var`), and the type of target (aggregated) parameters (`result_type` such as `"group_time"`, `"time"`, `"dynamic"`, or `"simple"`.) Here is a simple call. - -``` -#loading the package -library(fastdid) - -#generate simulated data -simdt <- sim_did(1e+03, 10, cov = "cont", second_cov = TRUE, second_outcome = TRUE) -dt <- simdt$dt - -#calling fastdid -result <- fastdid(data = dt, #the dataset - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", #name of the columns - result_type = "group_time") #the result type -``` - -You can control for covariates by providing the name of the data columns, and choose the method among doubly-robust (`"dr"`), inverse probability weight (`"ipw"`), and outcome regression (`"or"`). - -``` -result <- fastdid(data = dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", - result_type = "group_time", - control_option = "dr", #choose the control method - covaraitesvar = c("x", "x2")) #add covariates ``` - -While the default is to coerce the data into a balanced panel, you can allow for unbalanced panel. Note that currently only "ipw" is available when `allow_unbalance_panel = TRUE`. - -``` -result <- fastdid(data = dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", - result_type = "group_time", - allow_unbalance_panel = TRUE, #allow for unbalanced panel - covaraitesvar = c("x", "x2")) -``` - - -Clustered standard error can be obtained from multiplier bootstrap. - +library(fastdid) #loading the package +did_sim <- sim_did(1e+03, 10) #simulate some data +did_estimate <- fastdid(data = did_sim$dt, timevar = "time", + cohortvar = "G", unitvar = "unit", outcomevar = "y") ``` -result <- fastdid(data = dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", - result_type = "group_time", - clustervar = "x", boot = TRUE) #add clustering by using bootstrap -``` - -Estimation for multiple outcomes can be done in one call by providing a vector of outcome column names (saves a lot of time when controlling for covariates since logit estimates can be recycled across outcomes). - -``` -#calling fastdid -result <- fastdid(data = dt, #the dataset - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = c("y", "y2"), #name of the columns - result_type = "group_time") #the result type -``` - -# Performance - -**fastdid** is magnitudes faster than **did**, and 15x faster than the fastest alternative **DiDforBigData** for large dataset. - -Here is a comparison of run time for **fastdid**, **did**, and **DiDforBigData** (dfbd for short) using a panel of 10 periods and varying samples sizes. - -![time comparison](https://i.imgur.com/s5v32Rw.png) - -Unfortunately, the Author's computer fails to run **did** at 1 million sample. For a rough idea, **DiDforBigData** is about 100x faster than **did** in Bradley Setzler's [benchmark](https://setzler.github.io/DiDforBigData/articles/Background.html). Other staggered DiD implementations are even slower than **did**. - -**fastdid** also uses less memory. - -![RAM comparison](https://i.imgur.com/7emkgOz.png) - -For the benchmark, a baseline group-time ATT is estimated with no covariates control, no bootstrap, and no explicit parallelization. Computing time is measured by `microbenchmark` and peak RAM by `peakRAM`. - -# **fastdid** and **did** - -As the name suggests, **fastdid**'s goal is to be fast **did**. Besides performance, here are some comparisons between the two packages. - -## Estimates - -**fastdid**'s estimators is identical to **did**'s. As the performance gains mostly come from efficient data manipulation, the key estimation implementations are analogous. For example, 2x2 DiD (`estimate_did.R` and `DRDID::std_ipw_did_panel`), influence function from weights (`aggregate_gt.R/get_weight_influence`, `compute.aggte.R/wif`), and multiplier bootstrap (`get_se.R` and `mboot.R`). - -Therefore, the estimates are practically identical. For point estimates, the difference is negligible (smaller than 1e-12), and is most likely the result of [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). For standard errors, the estimates can be slightly different sometimes, but the difference never exceeds 1\% of **did**'s standard error estimates. - -## Interface - -**fastdid** should feel very similar to `att_gt`. But there are a few differences: - -Control group option: -| fastdid | did | control group used | -|-|-|-| -| both | notyettreated | never-treated + not-yet-but-eventually-treated | -| never| nevertreated | never-treated | -| notyet | | not-yet-but-eventually-treated | - -Aggregated parameters: `fastdid` aggregates in the same function. -| fastdid | did | -|-|-| -| group_time | no aggregation | -|dynamic|dynamic| -|time|calendar| -|group|group| -|simple|simple| - -## Other - -1. **fastdid** only offers inverse probability weights estimators for controlling for covariates when allowing for unbalanced panels. -2. **fastdid** use universal base periods as default. -4. **fastdid** only reports the pointwise confidence intervals, instead of the simultaneously valid confidence intervals (check section 4.1 of Callaway and Sant'Anna's (2021) for more detail.) - -# Roadmap - -**fastdid** is still in active development. Many features are planned to be added: - -- Multiple outcomes :white_check_mark: -- Min/max event time and balanced composition :white_check_mark: -- DR and OR estimators :white_check_mark: -- Allowing for unbalanced panels :white_check_mark: (*well, not fully because DR and OR still need to be added*) -- Anticipation :white_check_mark: -- Varying base periods :white_check_mark: -- Simultaneously valid confidence bands :white_check_mark: -- User-provided aggregation scheme -- User-provided control formula -- Further optimization - -# Source version - -Since **fastdid** is not on CRAN yet, it needs to be converted to R scripts to be used in some restricted environments. This can be done with `development/build_source.R`. After changing the working directory, the script will produce `development/fastdid_VERNAME.R`, which can be sourced to mimic the functionalities of the package. - -# Experimental features - -`fastdid` is intended to be a fast implementation of `did` and nothing more. However, as the list of features grows, it has also become a more flexible version of `did`. While such flexibility can be great, these features often lack proof of validity and can easily be misused. - -As an attempt to balance the validity and flexibility of `fastdid`, "experimental features" is introduced in version 0.9.4. These features will be less tested and documented, and it is generally advised to not use them unless the user know what they and the package are doing. These experimental features can be accessed via the `exper` argument. For example, to use the `filtervar` feature, call `fastdid(..., exper = list(filtervar = "FF"))`. - -The current list of experimental features are - -- `max`/`min_control_cohort_diff`: limit the max/min cohort difference between treated and control group -- `filtervar`: limit the units being used as treated and control group with a potentially-time-varying variable -- `max`/`min_dynamic`: limit the group-time combinations `fastdid` estimates by the range of `t-g`. - -# Update - -## 0.9.4 (2024/8/2) - -> [!WARNING] -> Some BREAKING change is introduced in this update. - -- add uniform confidence interval option with `cband` and significance level `alpha`, confidence interval are now provided in result as column `att_ciub` and `att_cilb` -- BREAKING: `filtervar`, `max_control_cohort_diff`, `min_control_cohort_diff` are moved into the experimental features. See the above section for the explanation. -- add `max_dynamic` and `min_dynamic` as experimental features. -- more informative error message when estimation fails for a specific `gt`, some internal interface overhaul - -## 0.9.3 (2024/5/7) - -- add anticipation and varying base period option -- add min and max control cohort difference -- add time-varying control ([reference](https://arxiv.org/abs/2202.02903)) -- add filtervar - -0.9.3.1 (2024/5/24): fix the bug with `univar == clustervar` (TODO: address problems with name-changing and collision). -0.9.3.2 (2024/7/17): fix group_time result when using `control_type = "notyet"` and make the base period in plots adapt to anticipation. -0.9.3.3 (2024/7/22): fix anticipation out of bound problem, more permanent solution for group_time target problem - -## 0.9.2 (2023/12/20) +The function returns a `data.table` that includes the estimates. Column `att` is the point estimate, `se` the standard error of the estimate, `att_ciub` and `att_cilb` the confidence interval. The other columns indexes the estimated parameter. -- add support to doubly robust and outcome regression estimators -- add support to unbalanced panels (simple and ipw only) -- add support to balanced composition option in dynamics aggregation -- fixed argument checking that was not working properly -- set the default to copying the entire dataset to avoid unexpected modification of the original data (thanks @grantmcdermott for the suggestion.) +To create event study plots, use `plot_did_dynamics(did_estimate)`. -## 0.9.1 (2023/10/20) +# More -- now supprts estimation for multiple outcomes in one go! -- data validation: no longer check missing values for columns not used. + - [did](https://bcallaway11.github.io/did/articles/did-basics.html): staggered Difference in Difference by Callaway and Sant'Anna + - [fastdid](../reference/fastdid.html): full list of arguments and features. + - [double](articles/double.html): introduction to DiD with multiple events. + - [misc](articles/misc.html): comparison with [did](https://github.com/bcallaway11/did), benchmark, tests, and experimental features. # Acknowledgments diff --git a/_pkgdown.yml b/_pkgdown.yml new file mode 100644 index 0000000..f8961ba --- /dev/null +++ b/_pkgdown.yml @@ -0,0 +1,4 @@ +url: https://tsailintung.github.io/fastdid/ +template: + bootstrap: 5 + diff --git a/cran-comments.md b/cran-comments.md new file mode 100644 index 0000000..62c010f --- /dev/null +++ b/cran-comments.md @@ -0,0 +1,5 @@ +## R CMD check results + +0 errors | 0 warnings | 0 note + +* This is a new release. diff --git a/development/did.R b/development/did.R index e36303c..19306f6 100644 --- a/development/did.R +++ b/development/did.R @@ -1,55 +1,33 @@ library(did) -library(fastdid) +library(data.table) rm(list = ls());gc() -data(mpdta) -mpdta$x <- rnorm(nrow(mpdta)) +#suspected problem: the unbalanced panel OR influence function calculation (DRDID::drdid_rc.R, 230, should be post - pre - or), since control function is subtracted -#the unbalanced panel OR influence function calculation (DRDID::drdid_rc.R, 230, should be post - pre - or), since control function is subtracted - -# did --------- - -# repeated cross section with -out <- att_gt(yname="lemp", - tname="year", - idname="countyreal", - gname="first.treat", - data=mpdta, - bstrap = FALSE, - allow_unbalanced_panel = TRUE) -agg <- aggte(out, type = "simple") -agg$overall.se - -out <- att_gt(yname="lemp", - tname="year", - idname="countyreal", - gname="first.treat", - data=mpdta, - bstrap = FALSE, - allow_unbalanced_panel = FALSE) -agg <- aggte(out, type = "simple") -agg$overall.se +#data +dt <- data.table( + time = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), + G = c(2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, Inf, Inf, Inf, Inf, Inf, Inf, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, Inf, Inf, Inf, Inf, Inf, Inf, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, Inf, Inf, Inf, Inf, Inf, Inf), + unit = c(6, 9, 11, 13, 14, 16, 17, 1, 3, 4, 8, 10, 15, 19, 2, 5, 7, 12, 18, 20, 6, 9, 11, 13, 14, 16, 17, 1, 3, 4, 8, 10, 15, 19, 2, 5, 7, 12, 18, 20, 6, 9, 11, 13, 14, 16, 17, 1, 3, 4, 8, 10, 15, 19, 2, 5, 7, 12, 18, 20), + x = c(4, 5, 4, 4, 4, 4, 1, 1, 4, 2, 1, 2, 4, 5, 1, 4, 3, 1, 2, 4, 4, 5, 4, 4, 4, 4, 1, 1, 4, 2, 1, 2, 4, 5, 1, 4, 3, 1, 2, 4, 4, 5, 4, 4, 4, 4, 1, 1, 4, 2, 1, 2, 4, 5, 1, 4, 3, 1, 2, 4), + y = c(-3.18649152, -3.25869527, -1.64841563, -2.43408372, -0.09610498, -1.65460665, 1.81631785, -0.53701069, -2.98435867, 0.50109073, 0.20436527, -0.24096772, -1.50765524, -3.44179477, 0.56591236, -3.48124692, -2.36647953, 2.25948062, -1.02782941, -0.82430706, 0.72018109, 1.34004336, 2.25809458, 1.47164831, 3.80850612, 2.25153806, 3.64015287, -0.02881929, -0.39501198, 1.70162601, 0.71358084, 0.96240644, 1.08049450, -0.16098213, 1.07408065, -0.89082004, -0.47072410, 2.76987894, 0.17469733, 1.76498883, -5.71884992, -6.37625509, -4.18192234, -4.96832686, -2.63206375, -4.18432324, 1.03219718, -2.16749889, -6.36580419, -1.71299118, -1.42506617, -2.45314625, -4.88839472, -7.40582661, -2.24368809, -8.03861697, -6.34331587, -0.55011068, -4.42010676, -5.38291640) +) +# did with for(est in c("dr", "ipw", "reg")){ for(p in c(FALSE, TRUE)){ - out <- att_gt(yname="lemp", - tname="year", - idname="countyreal", - gname="first.treat", - xformla=~x, - data=mpdta, - allow_unbalanced_panel = p, - bstrap = FALSE, - est_method = est) + out <- att_gt(yname="y", + tname="time", + idname="unit", + gname="G", + xformla=~x, + data=dt, + allow_unbalanced_panel = p, + bstrap = FALSE, + est_method = est) agg <- aggte(out, type = "simple") message("se for allow_unbalance_panel = ", p, " est = ", est, ": ", agg$overall.se) rm(out, agg) } -} - -# fastdid ------------ -mpdta <- as.data.table(mpdta) -setnames(mpdta, "first.treat", "firsttreat") -result <- fastdid(mpdta, timevar = "year", cohortvar = "firsttreat", unitvar = "countyreal",outcomevar = "lemp", result_type = "simple", - allow_unbalance_panel = FALSE, covariatesvar = "x") +} \ No newline at end of file diff --git a/development/fastdid_0_9_4.R b/development/fastdid_0_9_9.R similarity index 55% rename from development/fastdid_0_9_4.R rename to development/fastdid_0_9_9.R index 1b496ce..5746bc4 100644 --- a/development/fastdid_0_9_4.R +++ b/development/fastdid_0_9_9.R @@ -1,75 +1,134 @@ -#2024-08-06 -message('loading fastdid source ver. ver: 0.9.4 date: 2024-08-06') +#2024-08-22 +message('loading fastdid source ver. ver: 0.9.9 date: 2024-08-22') require(data.table); require(stringr); require(BMisc); require(collapse); require(dreamerr); require(parglm); +# high level ------------------------------------------------------------------- + aggregate_gt <- function(all_gt_result, aux, p){ - rbindlist(lapply(all_gt_result, aggregate_gt_outcome, aux, p)) + results <- lapply(all_gt_result, aggregate_gt_outcome, aux, p) + return(list( + est = rbindlist(lapply(results, function(x) {x$result})), + inf_func = lapply(results, function(x) {x$inf_func}), + agg_weight_matrix = lapply(results, function(x) {x$weight_matrix}) + )) } aggregate_gt_outcome <- function(gt_result, aux, p){ - + + #get aggregation scheme from g-t to target parameters agg_sch <- get_agg_sch(gt_result, aux, p) + + if(!p$event_specific | is.na(p$cohortvar2)){ + att <- gt_result$att + inf_func <- gt_result$inf_func %*% t(agg_sch$agg_weights) + } else { + att <- agg_sch$es_weight %*% gt_result$att + inf_func <- gt_result$inf_func %*% t(agg_sch$agg_weights %*% agg_sch$es_weight) + } - #get att and se - agg_att <- get_agg_att(gt_result, agg_sch, p) - inf_matrix <- get_agg_inf(gt_result, agg_sch, aux, p) - agg_se_result <- get_se(inf_matrix, aux, p) + #get att + agg_att <- agg_sch$agg_weights %*% att + + #get influence function matrix + + inf_weights <- get_weight_influence(att, agg_sch, aux, p) + inf_matrix <- inf_func + inf_weights + + #get se + agg_se <- get_se(inf_matrix, aux, p) # post process - result <- data.table(agg_sch$targets, agg_att, agg_se_result$se) + result <- data.table(agg_sch$targets, agg_att, agg_se$se) names(result) <- c("target", "att", "se") result[,`:=`(outcome = gt_result$outname, - att_ciub = att+se*agg_se_result$crit_val, - att_cilb = att-se*agg_se_result$crit_val)] + att_ciub = att+se*agg_se$crit_val, + att_cilb = att-se*agg_se$crit_val)] - return(result) + return(list(result = result, + inf_func = inf_matrix, + weight_matrix = agg_sch$agg_weights)) } +# scheme ------------------------------------------------------------------------ + #scheme for aggregation get_agg_sch <- function(gt_result, aux, p){ - - #setup stuff - weights <- aux$weights - id_cohorts <- aux$dt_inv[, G] - result_type <- p$result_type - agg_weights <- data.table() - + #create group_time - id_dt <- data.table(weight = weights/sum(weights), G = id_cohorts) + id_dt <- data.table(weight = aux$weights/sum(aux$weights), G = aux$dt_inv[, G]) pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] - group_time <- gt_result$gt |> merge(pg_dt, by = "G") - setorder(group_time, time, G) #change the order to match the order in gtatt - gt_count <- group_time[, .N] + group_time <- gt_result$gt |> merge(pg_dt, by = "G", sort = FALSE) + group_time[, mg := ming(G)] + group_time[, G1 := g1(G)] + group_time[, G2 := g2(G)] + setorder(group_time, time, mg, G1, G2) #change the order to match the order in gtatt + if(!all(names(gt_result$att) == group_time[, paste0(G, ".", time)])){stop("some bug makes gt misaligned, please report this to the maintainer. Thanks.")} + + #get the event-specific matrix, and available ggts + if(p$event_specific & !is.na(p$cohortvar2)){ + es <- get_es_scheme(group_time, aux, p) + group_time <- es$group_time #some gt may not have availble effect (ex: g1 == g2) + es_weight <- as.matrix(es$es_weight) + } else { + es_weight <- NULL + } + + #choose the target based on aggregation type + tg <- get_agg_targets(group_time, p) + group_time <- tg$group_time + targets <- tg$targets + - #nothing to do - if(result_type == "group_time"){ - return(list(targets = group_time[, paste0(G, ".", time)])) + #get aggregation weights + agg_weights <- data.table() + for(tar in targets){ #the order matters + + group_time[, weight := 0] #weight is 0 if not a target + group_time[target == tar & used, weight := pg/sum(pg)] + target_weights <- group_time[, .(weight)] |> transpose() + + agg_weights <- rbind(agg_weights, target_weights) } + group_time[, pg := NULL] - #choose the target based on aggregation type - group_time[, post := as.numeric(ifelse(time >= G, 1, -1))] - if (result_type == "dynamic") { - group_time[, target := time-G] - } else if (result_type == "group") { - group_time[, target := G*post] # group * treated - } else if (result_type == "time") { - group_time[, target := time*post] #calendar time * treated - } else if (result_type == "simple") { - group_time[, target := post] #treated / not treated - } - - targets <- sort(group_time[, unique(target)]) + agg_weights <- as.matrix(agg_weights) + + return(list(agg_weights = agg_weights, #a matrix of each target and gt's weight in it + targets = targets, + group_time = group_time, + es_weight = es_weight)) +} + +#get the target parameters +get_agg_targets <- function(group_time, p){ + group_time[, post := as.numeric(ifelse(time >= g1(G), 1, -1))] + switch(p$result_type, + dynamic = group_time[, target := time-g1(G)], + group = group_time[, target := g1(G)*post], # group * treated + time = group_time[, target := time*post], + simple = group_time[, target := post], + group_time = group_time[, target := paste0(g1(G), ".", time)], + group_group_time = group_time[, target := paste0(G, ".", time)] , + dynamic_sq = group_time[, target := paste0(time-g1(G), ".", g1(G)-g2(G))] + ) + + #allow custom aggregation scheme, this overides other stuff + if(!is.na(p$exper$aggregate_scheme)){ + group_time[, target := eval(str2lang(p$exper$aggregate_scheme))] + } + + targets <- group_time[, unique(target)] #for balanced cohort composition in dynamic setting #a cohort us only used if it is seen for all dynamic time - if(result_type == "dynamic" & !is.na(p$balanced_event_time)){ + if(p$result_type == "dynamic" & !is.na(p$balanced_event_time)){ - cohorts <- group_time[, .(max_et = max(time-G), - min_et = min(time-G)), by = "G"] + cohorts <- group_time[, .(max_et = max(target), #event time is target if in dynamic + min_et = min(target)), by = "G"] cohorts[, used := max_et >= p$balanced_event_time] #the max if(!cohorts[, any(used)]){stop("balanced_comp_range outside avalible range")} group_time[, used := G %in% cohorts[used == TRUE, G]] @@ -78,75 +137,67 @@ get_agg_sch <- function(gt_result, aux, p){ } else{group_time[, used := TRUE]} - for(tar in targets){ #the order matters - - group_time[, targeted := target == tar & used] - - total_pg <- group_time[targeted == TRUE, sum(pg)] #all gt that fits in the target - group_time[, weight := ifelse(targeted, pg/total_pg, 0)] #weight is 0 if not a target - target_weights <- group_time[, .(weight)] |> transpose() - - group_time[, targeted := NULL] - - agg_weights <- rbind(agg_weights, target_weights) - } + return(list(group_time = group_time, targets = targets)) - return(list(agg_weights = as.matrix(agg_weights), #a matrix of each target and gt's weight in it - targets = targets, - group_time = group_time)) } -#aggregated influence function -get_agg_inf <- function(gt_result, agg_sch, aux, p){ +# influence function ------------------------------------------------------------ + +get_weight_influence <- function(att, agg_sch, aux, p){ + + group <- agg_sch$group_time + + id_dt <- data.table(weight = aux$weights/sum(aux$weights), G = aux$dt_inv[, G]) + pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] + group <- group |> merge(pg_dt, by = "G", sort = FALSE) + + group[, time := as.integer(time)] + + if(is.na(p$cohortvar2)){ + group[, G := as.integer(G)] + setorder(group, time, G) + } else { + group[, mg := ming(G)] + group[, G1 := g1(G)] + group[, G2 := g2(G)] + setorder(group, time, mg, G1, G2) #sort + } + + if(!p$parallel){ + inf_weights <- sapply(asplit(agg_sch$agg_weights, 1), function (x){ + get_weight_influence_param(x, group, att, aux, p) + }) + } else { + inf_weights <- matrix(unlist(mclapply(asplit(agg_sch$agg_weights, 1), function (x){ + get_weight_influence_param(x, group, att, aux, p) + })), ncol = length(agg_sch$targets)) + } - if(p$result_type == "group_time"){return(gt_result$inf_func)} - inf_weights <- sapply(asplit(agg_sch$agg_weights, 1), function (x){ - get_weight_influence(x, gt_result$att, aux$weights, aux$dt_inv[, G], agg_sch$group_time[, .(G, time)]) - }) + return(inf_weights) - #aggregated influence function - inf_matrix <- (gt_result$inf_func %*% t(agg_sch$agg_weights)) + inf_weights - return(inf_matrix) } #influence from weight calculation -get_weight_influence <- function(agg_weights, gt_att, weights, id_cohorts, group) { - - keepers <- which(agg_weights > 0) +get_weight_influence_param <- function(agg_weights, group, gt_att, aux, p) { + + keepers <- which(agg_weights != 0) + group <- group[keepers,] + #moving this outside will create a g*t*id matrix, not really worth the memory + keepers_matrix <- as.matrix(aux$weights*sapply(1:nrow(group), function(g){as.integer(aux$dt_inv[, G] == group[g,G]) - group[g,pg]})) - id_dt <- data.table(weight = weights/sum(weights), G = id_cohorts) - pg_dt <- id_dt[, .(pg = sum(weight)), by = "G"] - group <- group |> merge(pg_dt, by = "G") + # gt weight = pgi / sum(pgi) + if1 <- keepers_matrix/sum(group[,pg]) #numerator + if2 <- rowSums(keepers_matrix) %*% t(group[,pg])/(sum(group[,pg])^2) #denominator - group[, time := as.integer(time)] - group[, G := as.integer(G)] - setorder(group, time, G) - - # effect of estimating weights in the numerator - if1 <- sapply(keepers, function(k) { - (weights*BMisc::TorF(id_cohorts == group[k,G]) - group[k,pg]) / - sum(group[keepers,pg]) - }) - # effect of estimating weights in the denominator - if2 <- base::rowSums(sapply(keepers, function(k) { - weights*BMisc::TorF(id_cohorts == group[k,G]) - group[k,pg] - })) %*% - t(group[keepers,pg]/(sum(group[keepers,pg])^2)) # return the influence function for the weights inf_weight <- (if1 - if2) %*% as.vector(gt_att[keepers]) inf_weight[abs(inf_weight) < sqrt(.Machine$double.eps)*10] <- 0 #fill zero + return(inf_weight) } -#aggregated att -get_agg_att <- function(gt_result, agg_sch, p){ - if(p$result_type == "group_time"){ - return(as.vector(gt_result$att)) - } else { - return(agg_sch$agg_weights %*% gt_result$att) - } -} +# se ------------------------------------------------------------------- #aggregated standard error get_se <- function(inf_matrix, aux, p) { @@ -174,9 +225,6 @@ get_se <- function(inf_matrix, aux, p) { #get sigma se <- dt_se[,(boot_top-boot_bot)/(qnorm(top_quant) - qnorm(bot_quant))] se[se < sqrt(.Machine$double.eps)*10] <- NA - - - } else { @@ -201,7 +249,491 @@ get_se <- function(inf_matrix, aux, p) { } -estimate_did <- function(dt_did, covvars, p, last_coef, cache){ +# auxilary steps in the main fastdid function + +get_exper_default <- function(exper, exper_args){ + for(arg in exper_args){ + if(is.null(exper[[arg]])){ + exper[[arg]] <- NA + } + } + + if(!is.na(exper$only_balance_2by2) & exper$only_balance_2by2){ #will create this col in the get_aux part + exper$filtervar <- "no_na" + exper$filtervar_post <- "no_na" + } + + return(exper) +} + +coerce_dt <- function(dt, p){ + + if(!is.na(p$cohortvar2)){return(coerce_dt_doub(dt, p))} #in doubledid.R + + #chcek if there is availble never-treated group + if(!is.infinite(dt[, max(G)]) & p$control_option != "notyet"){ + warning("no never-treated availble, effectively using not-yet-treated control") + } + + if(p$allow_unbalance_panel){ + dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] + setorder(dt_inv_raw, G) + dt_inv_raw[, new_unit := 1:.N] #let unit start from 1 .... N, useful for knowing which unit is missing + dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit", sort = FALSE) + dt[, unit := new_unit] + } + + setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access + + #deal with time, coerice time to 1,2,3,4,5....... + time_periods <- dt[, unique(time)] + time_size <- length(time_periods) + + #TODO: this part is kinda ugly + time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) + gcol <- str_subset(names(dt), ifelse(is.na(p$cohortvar2), "G", "G1|G2")) + if(time_offset != 0){ + dt[, G := G-time_offset] + + dt[, time := time-time_offset] + time_periods <- time_periods - time_offset + } + + time_step <- 1 #time may not jump at 1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ + time_step <- time_periods[2]-time_periods[1] + time_periods <- (time_periods-1)/time_step+1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} + dt[G != 1, G := (G-1)/time_step+1] + + dt[time != 1, time := (time-1)/time_step+1] + } + + #add the information to t + t <- list() + t$time_step <- time_step + t$time_offset <- time_offset + + if(nrow(dt) == 0){ + stop("no data after coercing the dataset") + } + + return(list(dt = dt, p = p, t = t)) + +} + +get_auxdata <- function(dt, p){ + + time_periods <- dt[, unique(time)] + id_size <- dt[, uniqueN(unit)] + + #construct the outcomes list for fast access later + #loop for multiple outcome + outcomes_list <- list() + for(outcol in p$outcomevar){ + outcomes <- list() + + if(!p$allow_unbalance_panel){ + for(i in time_periods){ + start <- (i-1)*id_size+1 + end <- i*id_size + outcomes[[i]] <- dt[seq(start,end), get(outcol)] + } + } else { + + for(i in time_periods){ + #populate a outcome vector of length N with outcome data in the right place + #NA is data gone or missing, will be addressed in estimate_did_rc + outcome_period <- rep(NA, id_size) + data_pos <- dt[time == i, unit] #units observed in i + outcome_period[data_pos] <- dt[time == i, get(outcol)] + outcomes[[i]] <- outcome_period + } + + } + + outcomes_list[[outcol]] <- outcomes + } + + #the time-invariant parts + if(!p$allow_unbalance_panel){ + dt_inv <- dt[1:id_size] + } else { + dt_inv <- dt[dt[, .I[1], by = unit]$V1] #the first observation + setorder(dt_inv, unit) #can't move this outside + } + + cohorts <- dt_inv[, unique(G)] + cohort_sizes <- dt_inv[, .(cohort_size = .N) , by = G] + + # the optional columns + varycovariates <- list() + if(!allNA(p$varycovariatesvar)){ + for(i in time_periods){ + start <- (i-1)*id_size+1 + end <- i*id_size + varycovariates[[i]] <- dt[seq(start,end), .SD, .SDcols = p$varycovariatesvar] + } + } else { + varycovariates <- NA + } + + #create na indicator for filtering + if(!is.na(p$exper$only_balance_2by2) & p$exper$only_balance_2by2){ + if("no_na" %in% names(dt)){stop("no_na is already in dt, consider using another column name")} + varnames <- unlist(p[str_ends(names(p), "var")], recursive = TRUE) #get all the argument that ends with "var" + varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar) & !is.na(varnames) & !is.null(varnames)] + dt[, no_na := TRUE] + for(col in varnames){ #nothing can be + if(is.na(col)){next} + dt[is.na(get(col)), no_na := TRUE] + } + } + + # filters + filters <- list() + if(!is.na(p$exper$filtervar)){ + for(t in time_periods){ + filters[[t]] <- rep(NA, id_size) + data_pos <- dt[time == t, unit] #units observed in i + filters[[t]][data_pos] <- unlist(dt[time == t, .SD, .SDcols = p$exper$filtervar]) + } + } else { + filters <- NA + } + + if(!allNA(p$covariatesvar)){ + covariates <- dt_inv[,.SD, .SDcols = p$covariatesvar] + } else { + covariates <- NA + } + + if(!is.na(p$clustervar)){ + cluster <- dt_inv[, .SD, .SDcols = p$clustervar] |> unlist() + } else { + cluster <- NA + } + + if(!is.na(p$weightvar)){ + weights <- dt_inv[, .SD, .SDcols = p$weightvar] |> unlist() + weights <- weights/mean(weights) #normalize + } else { + weights <- rep(1, id_size) + } + + aux <- as.list(environment()) + aux$dt <- NULL + aux$p <- NULL + class(aux) <- "locked" + + return(aux) + +} + +convert_targets <- function(results, p, t){ + + if(!is.na(p$exper$aggregate_scheme)){return(results)} #no conversion back if use custom + + switch(p$result_type, + dynamic = { + results[, event_time := target] + setcolorder(results, "event_time", before = 1) + }, + group = { + results[, type := ifelse(target >= 0, "post", "pre")] + results[, cohort := recover_time(abs(target), t)] + setcolorder(results, "cohort", before = 1) + }, + time = { + results[, type := ifelse(target >= 0, "post", "pre")] + results[, time := recover_time(abs(target), t)] + setcolorder(results, "time", before = 1) + }, + simple = { + results[, type := ifelse(target >= 0, "post", "pre")] + }, + group_time = { + results[, cohort := as.numeric(str_split_i(target, "\\.", 1))] + results[, time := as.numeric(str_split_i(target, "\\.", 2))] + + #recover the time + results[, cohort := recover_time(cohort, t)] + results[, time := recover_time(time, t)] + + }, + group_group_time = { + results[, cohort := str_split_i(target, "\\.", 1)] + results[, time := as.numeric(str_split_i(target, "\\.", 2))] + + results[, cohort1 := g1(cohort)] + results[, cohort2 := g2(cohort)] + + results[, cohort1 := recover_time(cohort1, t)] + results[, cohort2 := recover_time(cohort2, t)] + results[, time := recover_time(time, t)] + results[, `:=`(cohort = NULL)] + }, + dynamic_sq = { + results[, event_time_1 := as.numeric(str_split_i(target, "\\.", 1))] + results[, event_stagger := as.numeric(str_split_i(target, "\\.", 2))] + } + ) + + results[, target := NULL] + + return(results) +} + +# small stuff --------- + +recover_time <- function(time, t){ + return(((time-1)*t$time_step)+1+t$time_offset) +} + +#locked list +#from: https://stackoverflow.com/questions/58776481/make-r-function-return-a-locked-immutable-list +.S3method("[[<-", "locked", function(value) {stop("Can't assign into locked object")}) +.S3method("[<-", "locked", function(value) {stop("Can't assign into locked object")}) +.S3method("$<-", "locked", function(value) {stop("Can't assign into locked object")}) +# a <- list(b = 1, c = 2) +# class(a) <- c("locked") + +#' Diagnose Event Confoundedness +#' +#' @param data A data.table containing the panel data. +#' @param timevar The name of the time variable. +#' @param cohortvar The name of the cohort (group) variable. +#' @param control_option The control units used for the DiD estimates. Default is "both". +#' @param weightvar The name of the weight variable, if not specified will cluster on unit level (optional). +#' @param cohortvar2 The name of the second cohort (group) variable. +#' @param anticipation Period of anticipation +#' +#' @return A data.table containing the contamination ratio for each group-time. +#' @export +#' +#' @examples +#' # simulated data +#' simdt <- sim_did(1e+03, 10, cov = "cont", second_cov = TRUE, +#' second_outcome = TRUE, second_cohort = TRUE) +#' dt <- simdt$dt +#' +#' diagnose_confound_event(dt, "time", "G", "G2") +#' +#' @keywords difference-in-differences fast computation panel data estimation did +diagnose_confound_event <- function(data, timevar, cohortvar, cohortvar2, control_option = "both", weightvar = NA, anticipation = 0){ + + if(is.na(weightvar)){ + dt <- data[, .(w = .N), by = c(timevar, cohortvar, cohortvar2)] + } else { + dt <- data[, .(w = sum(get(weightvar))), by = c(timevar, cohortvar, cohortvar2)] + } + setnames(dt, c("time", "G", "G2", "w")) + dt[, D2 := as.numeric(time >= G2)] + if(control_option == "notyet"){dt <- dt[!is.infinite(G)]} + + expo <- data.table() + for(g in dt[, unique(G)]){ + for(t in dt[, unique(time)]){ + + min_control_cohort <- ifelse(control_option == "never", Inf, max(g,t)+anticipation+1) + + base <- g-1-anticipation + dt[,`:=`(tp = 0, cp = 0, tb = 0, cb = 0)] + dt[G == g & time == t, tp := w/sum(w)] + dt[G >= min_control_cohort & time == t, cp := w/sum(w)] + dt[G == g & time == base, tb := w/sum(w)] + dt[G >= min_control_cohort & time == base, cb := w/sum(w)] + if(any(dt[, lapply(.SD, sum), .SDcols = c("tp", "cp", "tb", "cb")] == 0)){next} #rule out invalid gt + + gamma <- dt[, sum(D2*(tp-cp-tb+cb))] #btw, this is how simple staggered DiD can be without conditional PT + w = dt[G == g & time == t, sum(w)] #weight for the simple aggregation + expo <- expo |> rbind(data.table(G = g, time = t, gamma = gamma, w = w)) + + } + } + + diag_result <- list(gtexpo = expo) + #TODO: make this list instead + class(diag_result) <- c("confound_diagnosis") #add class for generics + return(diag_result) + +} + + + +# utils ----------------------------------- + +#g11 <- c(1,1,2,3,4) +#g22 <- c(2,1,3,2,4) +#GG <- as.factor(paste0(g11, ".", g22)) + +g1 <- function(GG){ + if(is.numeric(GG)){return(GG)} + return(as.numeric(str_split_i(GG, "-", 1))) +} + +g2 <- function(GG){ + return(as.numeric(str_split_i(GG, "-", 2))) +} + +ming <- function(GG){ + if(is.numeric(GG)){return(GG)} + else {pmin(g1(GG), g2(GG))} +} + +# overiden function ------------------------------------------------- + +coerce_dt_doub <- function(dt, p){ + + #chcek if there is availble never-treated group + if(!is.infinite(dt[, max(ming(G))]) & p$control_option != "notyet"){ + warning("no never-treated availble, effectively using not-yet-treated control") + } + + setnames(dt, "G", "G1") + dt[, mg := pmin(G1, G2)] + setorder(dt, time, mg, G1, G2, unit) #for sort one quick access + + if(p$allow_unbalance_panel){ #let unit start from 1 .... N, useful for knowing which unit is missing + dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] + setorder(dt_inv_raw, mg, G1, G2) + dt_inv_raw[, new_unit := 1:.N] + dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit", sort = FALSE) + dt[, unit := new_unit] + } + + #deal with time, coerice time to 1,2,3,4,5....... + time_periods <- dt[, unique(time)] + time_size <- length(time_periods) + + time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) + gcol <- c("G1", "G2") + if(time_offset != 0){ + dt[, c(gcol) := .SD-time_offset, .SDcols = gcol] + dt[, time := time-time_offset] + time_periods <- time_periods - time_offset + } + + time_step <- 1 #time may not jump at 1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ + time_step <- time_periods[2]-time_periods[1] + time_periods <- (time_periods-1)/time_step+1 + if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} + + for(g in gcol){ + dt[get(g) != 1, c(g) := (get(g)-1)/time_step+1] + } + + dt[time != 1, time := (time-1)/time_step+1] + } + dt[, mg := pmin(G1, G2)] + dt[, G := paste0(G1, "-", G2)] #create G once its finalized + + #add the information to t + t <- list() + t$time_step <- time_step + t$time_offset <- time_offset + + if(nrow(dt) == 0){ + stop("no data after coercing the dataset") + } + + return(list(dt = dt, p = p, t = t)) + +} + +# aggregation scheme ----------------------------------------------------------- + +#the scheme for getting event-specific effect +get_es_scheme <- function(group_time, aux, p){ + + es_group_time <- copy(group_time) #group_time with available es effect + #create lookup + es_group_time[, mg := ming(G)] + es_group_time[, G1 := g1(G)] + es_group_time[, G2 := g2(G)] + es_weight_list <- list() + + ggt <- as.list(seq_len(nrow(group_time))) + if(!p$parallel){ + es_weight_list <- lapply(ggt, get_es_ggt_weight, group_time, aux, p) + } else { + es_weight_list <- mclapply(ggt, get_es_ggt_weight, group_time, aux, p, mc.cores = getDTthreads()) + } + + valid_ggt <- which(!sapply(es_weight_list, is.null)) + es_group_time <- es_group_time[valid_ggt] #remove the ones without + es_weight_list <- es_weight_list[valid_ggt] + es_weight <- do.call(rbind, es_weight_list) #not sure if the dim is right + + return(list(group_time = es_group_time, es_weight = es_weight)) + +} + +#get the scheme for retriving group-group-time estimates +get_es_ggt_weight <- function(ggt, group_time, aux, p){ + + group_time <- copy(group_time) #avoid accidental modification + + group_time[, weight := 0] #reset + t <- group_time[ggt, time] + g1 <- group_time[ggt, G1] + g2 <- group_time[ggt, G2] + gg <- group_time[ggt, G] + + if(t < g2){ #direct pure effect + + group_time[ggt, weight := 1] #just use the observed effect + + } else if(g1 < g2) { #imputation = treat-pre + (control-post - control-pre) + + base_period <- g2 - 1 + if(base_period == t){return(NULL)} + + #get the cohorts + tb <- group_time[,G == gg & time == base_period] + c <- group_time[, G1 == g1 & G2 > max(t,base_period) & G2 != g2] + cp <- group_time[, c & time == t] + cb <- group_time[, c & time == base_period] + + #if any group have no available cohort, skip + if(sum(tb) == 0 | sum(cp) == 0 | sum(cb) == 0){return(NULL)} + + #assign the weights + group_time[tb, weight := pg/sum(pg)] + group_time[cp, weight := pg/sum(pg)] + group_time[cb, weight := -pg/sum(pg)] + + + } else if (g1 > g2) { #double did = (treat-post - treat-base) - (control-post - control-pre) + + base_period <- g1 - 1 + if(base_period == t){return(NULL)} + + #get the cohorts + tp <- group_time[,.I == ggt] + tb <- group_time[,G == gg & time == base_period] + c <- group_time[,G2 == g2 & G1 > max(t,base_period) & G1 != g1] + cp <- group_time[, c & time == t] + cb <- group_time[, c & time == base_period] + + #if any group have no available cohort, skip + if(sum(tp) == 0 | sum(tb) == 0 | sum(cp) == 0 | sum(cb) == 0){return(NULL)} + + #assign the weights + group_time[tp, weight := pg/sum(pg)] + group_time[tb, weight := -pg/sum(pg)] + group_time[cp, weight := -pg/sum(pg)] + group_time[cb, weight := pg/sum(pg)] + + } + + if(all(group_time[, weight] == 0)){return(NULL)} + return(group_time[, weight]) + +} + +estimate_did <- function(dt_did, covvars, p, cache){ #estimate did param <- as.list(environment()) @@ -213,8 +745,7 @@ estimate_did <- function(dt_did, covvars, p, last_coef, cache){ return(result) } -estimate_did_bp <- function(dt_did, covvars, p, - last_coef = NULL, cache){ +estimate_did_bp <- function(dt_did, covvars, p, cache){ # preprocess -------- oldn <- dt_did[, .N] @@ -236,10 +767,10 @@ estimate_did_bp <- function(dt_did, covvars, p, if(ipw){ if(is.null(cache)){ #if no cache, calcuate ipw #estimate the logit - prop_score_est <- suppressWarnings(parglm.fit(covvars, dt_did[, D], - family = stats::binomial(), start = last_coef, + prop_score_est <- suppressWarnings(parglm::parglm.fit(covvars, dt_did[, D], + family = stats::binomial(), weights = dt_did[, weights], - control = parglm.control(nthreads = getDTthreads()), + control = parglm.control(nthreads = ifelse(p$parallel, 1, getDTthreads())), #no parallel if already parallel intercept = FALSE)) class(prop_score_est) <- "glm" #trick the vcov function to think that this is a glm object to dispatch the write method #const is implicitly put into the ipw formula, need to incorporate it manually @@ -252,8 +783,8 @@ estimate_did_bp <- function(dt_did, covvars, p, logit_coef[is.na(logit_coef)|abs(logit_coef) > 1e10] <- 0 #put extreme value and na to 0 prop_score_fit <- fitted(prop_score_est) - if(max(prop_score_fit) >= 1){warning(paste0("support overlap condition violated for some group_time"))} - prop_score_fit <- pmin(1-1e-16, prop_score_fit) #for the ipw + if(max(prop_score_fit) >= 1-1e-10){warning(paste0("extreme propensity score: ", max(prop_score_fit), ", support overlap is likely to be violated"))} #<=0 (only in control) is fine for ATT since it is just not used + prop_score_fit <- pmin(1-1e-10, prop_score_fit) #for the ipw hess <- stats::vcov(prop_score_est) * n #for the influence function hess[is.na(hess)|abs(hess) > 1e10] <- 0 @@ -372,13 +903,11 @@ estimate_did_bp <- function(dt_did, covvars, p, inf_func <- rep(0, oldn) #the default needs to be 0 for the matrix multiplication inf_func_no_na <- inf_func_no_na * oldn / n #adjust the value such that mean over the whole id size give the right result inf_func[data_pos] <- inf_func_no_na - - return(list(att = att, inf_func = inf_func, logit_coef = logit_coef, #for next gt - cache_ps_fit = prop_score_fit, cache_hess = hess)) #for next outcome + + return(list(att = att, inf_func = inf_func, cache = list(ps = prop_score_fit, hess = hess))) #for next outcome } -estimate_did_rc <- function(dt_did, covvars, p, - last_coef = NULL, cache){ +estimate_did_rc <- function(dt_did, covvars, p, cache){ #TODO: skip if not enough valid data @@ -416,9 +945,9 @@ estimate_did_rc <- function(dt_did, covvars, p, #estimate the logit prop_score_est <- suppressWarnings(parglm.fit(covvars, dt_did[, D], - family = stats::binomial(), start = last_coef, + family = stats::binomial(), weights = dt_did[, weights*(inpre+inpost)*n/(n_pre+n_post)], #when seen in both pre and post have double weight - control = parglm.control(nthreads = getDTthreads()), + control = parglm.control(nthreads = ifelse(p$parallel, 1, getDTthreads())), intercept = FALSE)) #*(inpre+inpost) class(prop_score_est) <- "glm" #trick the vcov function to think that this is a glm object to dispatch the write method #const is implicitly put into the ipw formula, need to incorporate it manually @@ -451,20 +980,18 @@ estimate_did_rc <- function(dt_did, covvars, p, if(or){ - stop("or in RC should not be called right now") - control_bool_post <- dt_did[, D==0 & inpost] #control group and have obs in post period control_bool_pre <- dt_did[, D==0 & inpre] reg_coef_post <- stats::coef(stats::lm.wfit(x = covvars[control_bool_post,], y = dt_did[control_bool_post,post.y], w = dt_did[control_bool_post,weights])) - + reg_coef_pre <- stats::coef(stats::lm.wfit(x = covvars[control_bool_pre,], y = dt_did[control_bool_pre,pre.y], w = dt_did[control_bool_pre,weights])) - + if(anyNA(reg_coef_post)|anyNA(reg_coef_pre)){ stop("some outcome regression resulted in NA coefficients, likely cause by perfect colinearity") } - + #the control function from outcome regression dt_did[, or_delta_post := as.vector(tcrossprod(reg_coef_post, covvars))] dt_did[, or_delta_pre := as.vector(tcrossprod(reg_coef_pre, covvars))] @@ -520,27 +1047,27 @@ estimate_did_rc <- function(dt_did, covvars, p, if(or){ - stop("or in RC should not be called right now") + M1_post <- colSums(dt_did[, inpost*treat_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wtpo M1_pre <- colSums(dt_did[, inpre*treat_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wtpr M3_post <- colSums(dt_did[, inpost*cont_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wcpo M3_pre <- colSums(dt_did[, inpre*cont_ipw_weight/n] * covvars, na.rm = TRUE) / mean_wcpr - + or_x_post <- dt_did[, inpost*weights*(1-D)] * covvars or_x_pre <- dt_did[, inpre*weights*(1-D)] * covvars or_ex_post <- dt_did[, inpost*weights*(1-D)*(post.y - or_delta_post)] * covvars or_ex_pre <- dt_did[, inpre*weights*(1-D)*(pre.y - or_delta_pre)] * covvars XpX_post <- crossprod(or_x_post, covvars)/n_post XpX_pre <- crossprod(or_x_pre, covvars)/n_pre - + #calculate alrw = eX (XpX)^-1 by solve XpX*alrw = ex, much faster since avoided inv asym_linear_or_post <- t(solve(XpX_post, t(or_ex_post))) asym_linear_or_pre <- t(solve(XpX_pre, t(or_ex_pre))) - + #or for treat inf_treat_or_post <- -asym_linear_or_post %*% M1_post #a negative sign here, since or_delta is subtracted from the att, THE PROBLEM inf_treat_or_pre <- -asym_linear_or_pre %*% M1_pre - + #or for control inf_cont_or_post <- -asym_linear_or_post %*% M3_post inf_cont_or_pre <- -asym_linear_or_pre %*% M3_pre @@ -580,8 +1107,13 @@ estimate_did_rc <- function(dt_did, covvars, p, inf_func <- rep(0, oldn) #the default needs to be 0 for the matrix multiplication inf_func[data_pos] <- inf_func_no_na_post - inf_func_no_na_pre - return(list(att = att, inf_func = inf_func, logit_coef = logit_coef, #for next gt - cache_ps_fit = prop_score_fit, cache_hess = hess)) #for next outcome + return(list(att = att, inf_func = inf_func, cache = list(ps = prop_score_fit, hess = hess))) #for next outcome +} + +# utilities ------ + +reverse_col <- function(x){ + return(x[,ncol(x):1]) } @@ -598,62 +1130,78 @@ estimate_gtatt <- function(aux, p){ return(outcome_results) } -estimate_gtatt_outcome <- function(y, aux, p, caches) { - - last_coef <- NULL - gt <- data.table() - gt_att <- c() - gt_inf_func <- data.table(placeholder = rep(NA, aux$id_size)) #populate with NA - - for(t in aux$time_periods){ - for(g in aux$cohorts){ - - # preparation ----------------------- - - gt_name <- paste0(g,".",t) - - #determine the 2x2 - base_period <- get_base_period(g,t,p) - did_setup <- get_did_setup(g, t, base_period, aux, p) - if(is.null(did_setup)){next} #no gtatt if no did setup - - #covariates matrix - covvars <- get_covvars(base_period, t, aux, p) - - #the 2x2 dataset - cohort_did <- data.table(did_setup, y[[t]], y[[base_period]], aux$weights) - names(cohort_did) <- c("D", "post.y", "pre.y", "weights") - - # estimate -------------------- - result <- tryCatch(estimate_did(dt_did = cohort_did, covvars, p, - last_coef, caches[[gt_name]]), - error = function(e){stop("DiD estimation failed for group-", recover_time(g, p$time_offset, p$time_step) , - " time-", recover_time(t, p$time_offset, p$time_step), ": ", e)}) - - # post process -------------------- - - #collect the result - gt <- rbind(gt, data.table(G = g, time = t)) #the sequence matters for the weights - gt_att <- c(gt_att, att = result$att) - gt_inf_func[[gt_name]] <- result$inf_func - - #caches - last_coef <- result$logit_coef #for faster convergence in next iter - if(is.null(caches[[gt_name]])){caches[[gt_name]] <- result$cache_fit} - - rm(result) - - } - } +#gtatt for each outcome +estimate_gtatt_outcome <- function(y, aux, p, caches) { + + treated_cohort <- aux$cohorts[!is.infinite(ming(aux$cohorts))] #otherwise would try to calculate the pre-period of nevertreated in varying base period lol + gt_all <- expand.grid(g = treated_cohort, t = aux$time_periods, stringsAsFactors = FALSE) |> transpose() |> as.list() #first loop t then g + + #main estimation + if(!p$parallel){ + gt_results <- lapply(gt_all, estimate_gtatt_outcome_gt, y, aux, p, caches) + } else { + gt_results <- mclapply(gt_all, estimate_gtatt_outcome_gt, y, aux, p, caches) + } + + #post process + gt_results <- gt_results[which(!sapply(gt_results, is.null))] #remove the ones with no valid didsetup + if(length(gt_results) == 0){stop("no valid group-times att to compute")} + + gt <- lapply(gt_results, function(x) {x$gt}) |> as.data.table() |> transpose() + names(gt) <- c("G", "time") + gt[, time := as.integer(time)] + + gt_att <- lapply(gt_results, function(x) {x$result$att}) + gt_inf_func <- lapply(gt_results, function(x) {x$result$inf_func}) + caches <- lapply(gt_results, function(x) {x$result$cache}) + + gt_names <- gt[,paste0(G,".",time)] + names(gt_att) <- gt_names + names(gt_inf_func) <- gt_names + names(caches) <- gt_names + + gt_inf_func <- do.call(cbind, gt_inf_func) + gt_att <- do.call(cbind, gt_att) |> t() + + return(list(est = list(gt = gt, att = gt_att, inf_func = gt_inf_func), caches = caches)) +} + +#gtatt for each outcome, each gt +estimate_gtatt_outcome_gt <- function(gt, y, aux, p, caches){ - gt_inf_func[,placeholder := NULL] - names(gt_att) <- names(gt_inf_func) - gt_inf_func <- as.matrix(gt_inf_func) + g <- gt[1] + t <- as.numeric(gt[2]) + + #find base time + gt_name <- paste0(g,".",t) + base_period <- get_base_period(g,t,p) + if(t == base_period | #no treatment effect for the base period + !base_period %in% aux$time_period){ #base period out of bounds + return(NULL) + } + #find treatment and control group + did_setup <- get_did_setup(g,t, base_period, aux, p) + valid_tc_groups <- any(did_setup == 1) & any(did_setup == 0) #if takes up too much time, consider use collapse anyv, but right now quite ok + if(!isTRUE(valid_tc_groups)){return(NULL)} #no treatment group or control group #isTRUE for na as false + + #covariates matrix + covvars <- get_covvars(base_period, t, aux, p) + + #the 2x2 dataset + cohort_did <- data.table(did_setup, y[[t]], y[[base_period]], aux$weights) + names(cohort_did) <- c("D", "post.y", "pre.y", "weights") + + # estimate -------------------- + result <- tryCatch(estimate_did(dt_did = cohort_did, covvars, p, caches[[gt_name]]), + error = function(e){stop("2-by-2 DiD failed for internal group-time ",g , + "-", t, ": ", e)}) + return(list(gt = gt, result = result)) - return(list(est = list(gt = gt, att = gt_att, inf_func = gt_inf_func), caches = caches)) } + get_base_period <- function(g,t,p){ + g <- ming(g) #for two period if(p$base_period == "universal"){ base_period <- g-1-p$anticipation } else { @@ -664,54 +1212,41 @@ get_base_period <- function(g,t,p){ get_did_setup <- function(g, t, base_period, aux, p){ - treated_cohorts <- aux$cohorts[!is.infinite(aux$cohorts)] - - #get the range of cohorts - if(p$control_option == "never"){ - min_control_cohort <- Inf - } else { - min_control_cohort <- max(t, base_period)+p$anticipation+1 - } - max_control_cohort <- ifelse(p$control_option == "notyet", max(treated_cohorts), Inf) - - #experimental + treated_cohorts <- aux$cohorts[!is.infinite(ming(aux$cohorts))] + + min_control_cohort <- ifelse(p$control_option == "never", Inf, max(t, base_period)+p$anticipation+1) + max_control_cohort <- ifelse(p$control_option == "notyet", max(ming(treated_cohorts)), Inf) + if(!is.na(p$exper$max_control_cohort_diff)){ - max_control_cohort <- min(g+p$exper$max_control_cohort_diff, max(treated_cohorts)) - } - if(!is.na(p$exper$min_control_cohort_diff)){ - min_control_cohort <- max(g+p$exper$min_control_cohort_diff, min(treated_cohorts)) - } - if((!is.na(p$exper$max_dynamic))){ - if(t-g > p$exper$max_dynamic){return(NULL)} - } - if(!is.na(p$exper$min_dynamic)){ - if(t-g < p$exper$min_dynamic){return(NULL)} - } - - # invalid gt - if(t == base_period | #no treatment effect for the base period - base_period < min(aux$time_periods) | #no treatment effect for the first period, since base period is not observed - g >= max_control_cohort | #no treatment effect for never treated or the last treated cohort (for not yet notyet) - t >= max_control_cohort | #no control available if the last cohort is treated too - min_control_cohort > max_control_cohort){ #no control avalilble, most likely due to anticipation - return(NULL) + max_control_cohort <- min(g+p$exper$max_control_cohort_diff, max_control_cohort) } #select the control and treated cohorts did_setup <- rep(NA, aux$id_size) - did_setup[get_cohort_pos(aux$cohort_sizes, min_control_cohort, max_control_cohort)] <- 0 - did_setup[get_cohort_pos(aux$cohort_sizes, g)] <- 1 #treated cannot be controls, assign treated after control to overwrite + did_setup[get_control_pos(aux$cohort_sizes, min_control_cohort, max_control_cohort)] <- 0 + did_setup[get_treat_pos(aux$cohort_sizes, g)] <- 1 #treated cannot be controls, assign treated after control to overwrite if(!is.na(p$exper$filtervar)){ did_setup[!aux$filters[[base_period]]] <- NA #only use units with filter == TRUE at base period + + } + if(!is.na(p$exper$filtervar_post)){ + did_setup[!aux$filters[[t]]] <- NA #only use units with filter == TRUE at target period } return(did_setup) } -get_cohort_pos <- function(cohort_sizes, start_cohort, end_cohort = start_cohort){ - start <- cohort_sizes[G < start_cohort, sum(cohort_size)]+1 - end <- cohort_sizes[G <= end_cohort, sum(cohort_size)] +get_control_pos <- function(cohort_sizes, start_cohort, end_cohort = start_cohort){ + start <- cohort_sizes[ming(G) < start_cohort, sum(cohort_size)]+1 + end <- cohort_sizes[ming(G) <= end_cohort, sum(cohort_size)] + return(start:end) +} + +get_treat_pos <- function(cohort_sizes, treat_cohort){ #need to separate for double did, need to match exact g-g-t + index <- which(cohort_sizes[,G] == treat_cohort) + start <- ifelse(index == 1, 1, cohort_sizes[1:(index-1), sum(cohort_size)]+1) + end <- cohort_sizes[1:index, sum(cohort_size)] return(start:end) } @@ -767,10 +1302,17 @@ get_covvars <- function(base_period, t, aux, p){ #' @param anticipation periods with aniticipation (delta in CS, default is 0, reference period is g - delta - 1). #' @param exper the list of experimental features, for features that are not in CSDID originally. Generally less tested. #' @param base_period same as did +#' @param full return the full result, like the influence function, call, etc,. Default is false. +#' @param parallel whether to use parallization (only available on unix systesm like Mac or Linux.) +#' @param cohortvar2 The name of the second cohort (group) variable. +#' @param event_specific Whether to recover event_specific treatment effect or report combined effect. #' -#' @import data.table parglm stringr dreamerr BMisc -#' @importFrom stats quantile vcov sd binomial fitted qnorm rnorm as.formula +#' @import data.table stringr dreamerr ggplot2 +#' @importFrom stats quantile vcov sd binomial fitted qnorm rnorm as.formula weighted.mean #' @importFrom collapse allNA fnrow whichNA fnunique fsum na_insert +#' @importFrom parallel mclapply +#' @importFrom BMisc multiplier_bootstrap +#' @importFrom parglm parglm.fit parglm.control #' @return A data.table containing the estimated treatment effects and standard errors. #' @export #' @@ -810,9 +1352,10 @@ fastdid <- function(data, weightvar=NA,clustervar=NA, covariatesvar = NA, varycovariatesvar = NA, copy = TRUE, validate = TRUE, anticipation = 0, base_period = "universal", - exper = NULL){ - - # validation -------------------------------------------------------- + exper = NULL, full = FALSE, parallel = FALSE, + cohortvar2 = NA, event_specific = TRUE){ + + # preprocess -------------------------------------------------------- if(!is.data.table(data)){ warning("coercing input into a data.table.") @@ -824,242 +1367,135 @@ fastdid <- function(data, p <- as.list(environment()) #collect everything besides data p$data <- NULL p$dt <- NULL - p$exper <- get_exper_default(p$exper) + + exper_args <- c("filtervar", "filtervar_post", "only_balance_2by2", + "aggregate_scheme", "max_control_cohort_diff") + p$exper <- get_exper_default(p$exper, exper_args) + class(p) <- "locked" #no more changes! validate_argument(dt, p) - - # validate and throw away not legal data + #change name for main columns setnames(dt, c(timevar, cohortvar, unitvar), c("time", "G", "unit")) - dt <- validate_dt(dt, p) + if(!is.na(p$cohortvar2)){setnames(dt, p$cohortvar2, "G2")} - # preprocess ----------------------------------------------------------- + # validate and throw away not legal data + dt <- validate_dt(dt, p) #make dt conform to the WLOG assumptions of fastdid coerce_result <- coerce_dt(dt, p) #also changed dt - dt <- coerce_result$dt - p <- coerce_result$p # get auxiliary data - aux <- get_auxdata(dt, p) - - # main part ------------------------------------------------- + aux <- get_auxdata(coerce_result$dt, p) - gt_result_list <- estimate_gtatt(aux, p) + # main estimation ------------------------------------------------- - all_result <- aggregate_gt(gt_result_list, aux, p) + gt_result_list <- estimate_gtatt(aux, p) + agg_result <- aggregate_gt(gt_result_list, aux, p) - #convert "targets" back to meaningful parameter identifiers like cohort 1 post, time 2 post - all_result <- convert_targets(all_result, p) + # post process ------------------------------------------- - return(all_result) + #convert "targets" back to meaningful parameters + est_results <- convert_targets(agg_result$est, p, coerce_result$t) + class(est_results) <- c("fastdid_est", class(est_results)) -} - -# small steps ---------------------------------------------------------------------- - -get_exper_default <- function(exper){ - exper_args <- c("filtervar", "min_dynamic", "max_dynamic", "min_control_cohort_diff", "max_control_cohort_diff") - for(arg in exper_args){ - if(is.null(exper[[arg]])){ - exper[[arg]] <- NA - } + if(!p$full){ + return(est_results) + } else { + full_result <- list( + call = p, + estimate = est_results, + gt_estimate = gt_result_list, + agg_inf_func = agg_result$inf_func, + agg_weight_matrix = agg_result$agg_weight_matrix + ) + class(full_result) <- c("fastdid_result", class(full_result)) + return(full_result) } - return(exper) } -coerce_dt <- function(dt, p){ - +#' @export +plot.fastdid_result <- function(x,...){return(plot(x$estimate,...))} - #chcek if there is availble never-treated group - if(!is.infinite(dt[, max(G)]) & p$control_option != "notyet"){ - warning("no never-treated availble, switching to not-yet-treated control") - p$control_option <- "notyet" - } - - if(p$allow_unbalance_panel){ - dt_inv_raw <- dt[dt[, .I[1], by = unit]$V1] - setorder(dt_inv_raw, G) - dt_inv_raw[, new_unit := 1:.N] #let unit start from 1 .... N, useful for knowing which unit is missing - dt <- dt |> merge(dt_inv_raw[,.(unit, new_unit)], by = "unit") - dt[, unit := new_unit] - } - - setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access - #deal with time, coerice time to 1,2,3,4,5....... - time_periods <- dt[, unique(time)] - time_size <- length(time_periods) - - time_offset <- min(time_periods) - 1 #assume time starts at 1, first is min after sort :) - if(time_offset != 0){ - dt[, G := G-time_offset] - dt[, time := time-time_offset] - time_periods <- time_periods - time_offset - } - - time_step <- 1 #time may not jump at 1 - if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){ - time_step <- time_periods[2]-time_periods[1] - time_periods <- (time_periods-1)/time_step+1 - if(any(time_periods[2:length(time_periods)] - time_periods[1:length(time_periods)-1] != 1)){stop("time step is not uniform")} - dt[G != 1, G := (G-1)/time_step+1] - dt[time != 1, time := (time-1)/time_step+1] +#' @export +plot.fastdid_est <- function(x,...){ + #dispatch to functions by result_type + if("event_time" %in% names(x)){ + plot <- plot_did_dynamics(x) + } else if ("cohort" %in% names(x) & "time" %in% names(x)){ + stop("don't have plot for group-time for now") + } else if ("cohort" %in% names(x)){ + plot <- plot_did_dynamics(x, "cohort") + } else if ("time" %in% names(x)){ + plot <- plot_did_dynamics(x, "time") } - #add the information to p - p$time_step = time_step - p$time_offset = time_offset - - if(nrow(dt) == 0){ - stop("no data after coercing the dataset") + if(x[,uniqueN(outcome)] > 1){ + plot <- plot + facet_wrap(~outcome, scales = "free") } - return(list(dt = dt, p = p)) - + return(plot) } -get_auxdata <- function(dt, p){ - - time_periods <- dt[, unique(time)] - id_size <- dt[, uniqueN(unit)] - - setorder(dt, time, G, unit) #sort the dataset essential for the sort-once-quick-access +#' Create plot for Difference-in-Differences (DiD) analysis. +#' +#' This function generates an plot based on the results of a DiD analysis. +#' +#' @param x A data table containing the results of the DiD analysis. It should include columns for 'att' (average treatment effect), 'se' (standard error), and 'event_time' (time points). +#' @param margin the x-axis of the plot +#' +#' @return A ggplot2 object representing the event study plot. +#' @export +plot_did_dynamics <-function(x, margin = "event_time"){ - #construct the outcomes list for fast access later - #loop for multiple outcome - outcomes_list <- list() - for(outcol in p$outcomevar){ - outcomes <- list() + #find the base_period + if(margin == "event_time"){ + et_range <- min(x[, event_time]):max(x[, event_time]) + base_time <- et_range[!et_range %in% x[, unique(event_time)]] + if(length(base_time)!=1){stop("missing more than one period")} - if(!p$allow_unbalance_panel){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - outcomes[[i]] <- dt[seq(start,end), get(outcol)] - } + #add the base period + if("outcome" %in% names(x)){ + base_row <- data.table(att = 0, se = 0, event_time = base_time, outcome = x[, unique(outcome)], att_ciub = 0, att_cilb = 0) } else { - - for(i in time_periods){ - #populate a outcome vector of length N with outcome data in the right place - #NA is data gone or missing, will be addressed in estimate_did_rc - outcome_period <- rep(NA, id_size) - data_pos <- dt[time == i, unit] #units observed in i - outcome_period[data_pos] <- dt[time == i, get(outcol)] - outcomes[[i]] <- outcome_period - } - - } - - outcomes_list[[outcol]] <- outcomes - } - - #the time-invariant parts - if(!p$allow_unbalance_panel){ - dt_inv <- dt[1:id_size] - } else { - dt_inv <- dt[dt[, .I[1], by = unit]$V1] #the first observation - setorder(dt_inv, unit) #can't move this outside - } - - cohorts <- dt_inv[, unique(G)] - cohort_sizes <- dt_inv[, .(cohort_size = .N) , by = G] - - # the optional columns - varycovariates <- list() - if(!allNA(p$varycovariatesvar)){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - varycovariates[[i]] <- dt[seq(start,end), .SD, .SDcols = p$varycovariatesvar] - } - } else { - varycovariates <- NA - } - - # filters - filters <- list() - if(!is.na(p$exper$filtervar)){ - for(i in time_periods){ - start <- (i-1)*id_size+1 - end <- i*id_size - filters[[i]] <- dt[seq(start,end), .SD, .SDcols = p$exper$filtervar] + base_row <- data.table(att = 0, se = 0, event_time = base_time, att_ciub = 0, att_cilb = 0) } - } else { - filters <- NA - } - - if(!allNA(p$covariatesvar)){ - covariates <- dt_inv[,.SD, .SDcols = p$covariatesvar] - } else { - covariates <- NA + x <- x |> rbind(base_row, fill = TRUE) } - if(!is.na(p$clustervar)){ - cluster <- dt_inv[, .SD, .SDcols = p$clustervar] |> unlist() - } else { - cluster <- NA - } + plot <- x |> + ggplot() + + geom_hline(yintercept = 0, linetype = 'dashed', col = 'red') + + geom_point( aes(x = eval(str2lang(margin)), y = att), color = "black") + #point est + geom_errorbar( aes(x = eval(str2lang(margin)), ymin = att_cilb, ymax = att_ciub), + width = 0.1, linetype = "dashed") + #CI + labs(x = margin) - if(!is.na(p$weightvar)){ - weights <- dt_inv[, .SD, .SDcols = p$weightvar] |> unlist() - } else { - weights <- rep(1, id_size) + if(margin == "event_time"){ + plot <- plot + geom_line( aes(x = eval(str2lang(margin)), y = att), color = "black") #point est } - - aux <- list(time_periods = time_periods, - id_size = id_size, - outcomes_list = outcomes_list, - dt_inv= dt_inv, - cohorts = cohorts, - cohort_sizes = cohort_sizes, - varycovariates = varycovariates, - covariates = covariates, - cluster = cluster, - weights = weights, - filters = filters) - - return(aux) + + return(plot) } -convert_targets <- function(results, p){ - - result_type <- p$result_type +# confound diagnosis --------------------- - if(result_type == "dynamic"){ - setnames(results, "target", "event_time") - - } else if (result_type == "cohort"){ - - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := recover_time(abs(target), p$time_offset, p$time_step)] - setnames(results, "target", "cohort") - - } else if (result_type == "calendar"){ - - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := recover_time(abs(target), p$time_offset, p$time_step)] - setnames(results, "target", "time") - - } else if (result_type == "group_time"){ - - results[, cohort := as.numeric(str_split_i(target, "\\.", 1))] - results[, time := as.numeric(str_split_i(target, "\\.", 2))] - - #recover the time - results[, cohort := recover_time(cohort, p$time_offset, p$time_step)] - results[, time := recover_time(time, p$time_offset, p$time_step)] - - results[, target := NULL] - - } else if (result_type == "simple") { - results[, type := ifelse(target >= 0, "post", "pre")] - results[, target := NULL] - } - return(results) +#' @export +print.confound_diagnosis <- function(x,...) { + cat(paste0("event correlation: ", mean(x))) +} + +#' @export +plot.confound_diagnosis <- function(x,...) { + + x$gtexpo[time >= G] |> ggplot( aes(x = as.factor(time-G), y = as.factor(G), fill = gamma)) + geom_tile() + + scale_fill_distiller(palette = "RdBu", limits = c(-1, 1)) + + labs(subtitle = paste0("event correlation: ", mean(x)), y = "G", x = "event_time") } -recover_time <- function(time, time_offset, time_step){ - return(((time-1)*time_step)+1+time_offset) +#' @export +mean.confound_diagnosis <- function(x,...){ + return(x$gtexpo[time>=G,weighted.mean(gamma, w)]) } NULL @@ -1073,65 +1509,11 @@ utils::globalVariables(c('.','agg_weight','att','att_cont','att_treat','attgt',' 'V1','att_cont_post','att_cont_pre','att_treat_post','att_treat_pre','inpost','inpre','max_et','min_et','new_unit','or_delta','or_delta_post','or_delta_pre','targeted','used', "timevar", "cohortvar", "unitvar", "outcomevar", "control_option", "result_type", "balanced_event_time", "control_type", "allow_unbalance_panel", "boot", "biters", "weightvar", "clustervar", "covariatesvar", "varycovariatesvar", "filtervar", - "copy", "validate", "max_control_cohort_diff", "anticipation", "min_control_cohort_diff", "base_period", "post", "att_ciub", "att_cilb", "cband", "alpha")) - - -#' Create an event study plot for Difference-in-Differences (DiD) analysis. -#' -#' This function generates an event study plot based on the results of a DiD analysis. -#' -#' @param dt A data table containing the results of the DiD analysis. It should include columns for 'att' (average treatment effect), 'se' (standard error), and 'event_time' (time points). -#' @param graphname A character string specifying the title of the plot (default is "event study plot"). -#' @param note A character string for adding additional notes or comments to the plot (default is empty). -#' -#' @return A ggplot2 object representing the event study plot. -#' @export - -plot_did_dynamics <-function(dt, - graphname = "event study plot", note = "" -){ - - if (!requireNamespace("ggplot2", quietly = TRUE)) { - warning("The ggplot2 package must be installed to use plotting functions") - #Either exit or do something without rgl - return(NULL) - } - - - #find the base_period - et_range <- min(dt[, event_time]):max(dt[, event_time]) - base_time <- et_range[!et_range %in% dt[, unique(event_time)]] - if(length(base_time)!=1){stop("missing more then one period")} - - #add the base period - if("outcome" %in% names(dt)){ - base_row <- data.table(att = 0, se = 0, event_time = base_time, outcome = dt[, unique(outcome)], att_ciub = 0, att_cilb = 0) - } else { - base_row <- data.table(att = 0, se = 0, event_time = base_time, att_ciub = 0, att_cilb = 0) - } - - #dt <- dt |> add_base_row(base_row) - dt <- dt |> rbind(base_row, fill = TRUE) - - - figure <- dt |> - ggplot2::ggplot() + - ggplot2::geom_hline(yintercept = 0, linetype = 'dashed', col = 'red') - - figure <- figure + ggplot2::geom_line(ggplot2::aes(x = event_time, y = att), color = "black") + - ggplot2::geom_point(ggplot2::aes(x = event_time, y = att), color = "black") + - ggplot2::geom_errorbar(ggplot2::aes(x = event_time, ymin = att_cilb, ymax = att_ciub), - width = 0.1, linetype = "dashed") + "copy", "validate", "max_control_cohort_diff", "anticipation", "min_control_cohort_diff", "base_period", "post", "att_ciub", "att_cilb", "cband", "alpha", + "G2", "G1", "mg", "cohort1", "cohort2", "event_time_1", "event_time_2", + "D2", "attgt2", "event", "atu2", "y01", "y10", "y11", "tau2", "parallel", + "tp", "cp", "tb", "cb", "no_na", "event_stagger")) - if("outcome" %in% names(dt)){ - figure <- figure + ggplot2::facet_wrap(~outcome, scales = "free") - } - - figure <- figure + ggplot2::theme_classic() - - return(figure) - -} #' Simulate a Difference-in-Differences (DiD) dataset #' @@ -1151,6 +1533,8 @@ plot_did_dynamics <-function(dt, #' @param stratify Whether to stratify the dataset based on a binary covariate. #' @param treatment_assign The method for treatment assignment ("latent" or "uniform"). #' @param vary_cov include time-varying covariates +#' @param second_cohort include confounding events +#' @param confound_ratio extent of event confoundedness #' #' @return A list containing the simulated dataset (dt) and the treatment effect values (att). #' @@ -1164,7 +1548,7 @@ plot_did_dynamics <-function(dt, #' @export sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size = 0.001, cov = "no", hetero = "all", second_outcome = FALSE, second_cov = FALSE, vary_cov = FALSE, na = "none", - balanced = TRUE, seed = NA, stratify = FALSE, treatment_assign = "latent"){ + balanced = TRUE, seed = NA, stratify = FALSE, treatment_assign = "latent", second_cohort = FALSE, confound_ratio = 1){ if(!is.na(seed)){set.seed(seed)} @@ -1193,7 +1577,8 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size #assign treated group based on a latent related to X if(treatment_assign == "latent"){ - dt_i[, treat_latent := x*0.2 + x2*0.2 + rnorm(sample_size)] #unit with larger X tend to be treated and treated earlier + ep1 <- rnorm(sample_size) + dt_i[, treat_latent := x*0.2 + x2*0.2 + ep1] #unit with larger X tend to be treated and treated earlier untreated_thres <- quantile(dt_i$treat_latent, untreated_prop) dt_i[treat_latent <= untreated_thres, G := Inf] #unit with low latent is never treated @@ -1210,6 +1595,28 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size dt_i[,G := floor((unit-1)/(sample_size/time_period))] dt_i[G < 2, G := Inf] } + + if(second_cohort){ + setnames(dt_i, "G", "G2") + if(treatment_assign == "latent"){ + dt_i[, treat_latent := x*0.2 + x2*0.2 + ep1*confound_ratio + rnorm(sample_size)] #unit with larger X tend to be treated and treated earlier + untreated_thres <- quantile(dt_i$treat_latent, untreated_prop) + dt_i[treat_latent <= untreated_thres, G := Inf] #unit with low latent is never treated + + cohort_prop <- (1-untreated_prop)/(time_period-1) + last_treat_thres <- untreated_thres + for(t in time_period:2){ #unit start getting treated in t = 2 + treat_thres <- quantile(dt_i$treat_latent, untreated_prop + cohort_prop*(time_period - t + 1)) + dt_i[treat_latent <= treat_thres & treat_latent > last_treat_thres, G := t] + last_treat_thres <- treat_thres + } + rm(t) + } else if (treatment_assign == "uniform"){ + #when treatment is set to 'uniform', untreated propensity is fixed + dt_i[,G := floor((unit-1)/(sample_size/time_period))] + dt_i[G < 2, G := Inf] + } + } #assign unit FE dt_i[, unit_fe := rnorm(sample_size)] @@ -1225,11 +1632,9 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size dt <- dt |> merge(dt_i, by = "unit") dt <- dt |> merge(dt_t, by = "time") - dt[, D := as.integer(time >= G)] - #add time_varying covariates if(vary_cov){ - dt[, xvar := pmin(G, time_period+4)*time+rnorm(sample_size*time_period, 0,30)] #should be confounding....? + dt[, xvar := pmin(G, time_period+4)*time^(1/3)*0.1+rnorm(sample_size*time_period, 0,10)] #should be confounding....? } else { dt[, xvar := 1] } @@ -1237,6 +1642,7 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size #untreated potential outcomes dt[, y0 := unit_fe + time_fe + (x+x2)*x_trend + xvar + rnorm(sample_size*time_period, sd = epsilon_size)] + dt[, D := as.integer(time >= G)] #generate gtatt att <- CJ(G = 1:time_period, time = 1:time_period) if(hetero == "all"){ @@ -1247,15 +1653,47 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size } } att[time < G, attgt := 0] #no anticipation - dt <- dt |> merge(att, by = c("time", "G"), all.x = TRUE, all.y = FALSE) dt[is.na(attgt), attgt := 0] dt[, tau := attgt*s] - #potential outcome - dt[, y1 := y0 + tau] - dt[, y := y1*D + y0*(1-D)] - dt <- dt[, .SD, .SDcols = c("time", "G", "unit", "x", "x2", "y", "s", "xvar")] + if(second_cohort){ + dt[, D2 := as.integer(time >= G2)] + #generate gtatt + att2 <- CJ(G2 = 1:time_period, time = 1:time_period) + if(hetero == "all"){ + att2[, attgt2 := rnorm(time_period*time_period, mean = 2, sd = 1)] + } else if (hetero == "dynamic"){ + for(event_t in 0:max(att2[,time-G2])){ + att2[time - G2 == event_t, attgt2 := rnorm(1, mean = 2, sd = 1)] + } + } + att2[time < G2, attgt2 := 0] #no anticipation + + #add att2 to att + att[, event := 1] + att2[, event := 2] + att <- rbind(att, att2, fill = TRUE) + + dt <- dt |> merge(att2, by = c("time", "G2"), all.x = TRUE, all.y = FALSE) + dt[is.na(attgt2), attgt2 := 0] + dt[, tau2 := attgt2*s] + + #potential outcome + dt[, y10 := y0 + tau] + dt[, y01 := y0 + tau2] + dt[, y11 := y0 + tau + tau2] + dt[, y := y0*(1-D)*(1-D2)+ y10*D*(1-D2) + y01*(1-D)*D2 + y11*D*D2] + cols <- c("time", "G", "G2", "unit", "x", "x2", "y", "s", "xvar") + + } else { + #potential outcome + dt[, y1 := y0 + tau] + dt[, y := y1*D + y0*(1-D)] + cols <- c("time", "G", "unit", "x", "x2", "y", "s", "xvar") + } + + dt <- dt[, .SD, .SDcols = cols] #additional ----------------- @@ -1282,15 +1720,6 @@ sim_did <- function(sample_size, time_period, untreated_prop = 0.3, epsilon_size -set_max_thread <- function(){ - data.table::setDTthreads(0) - options(kit.nThread = getDTthreads()) -} - -reverse_col <- function(x){ - return(x[,ncol(x):1]) -} - validate_argument <- function(dt, p){ if(!p$validate){ @@ -1317,19 +1746,19 @@ validate_argument <- function(dt, p){ check_set_arg(control_option, "match", .choices = c("both", "never", "notyet"), .up = 1) #kinda bad names since did's notyet include both notyet and never check_set_arg(control_type, "match", .choices = c("ipw", "reg", "dr"), .up = 1) check_set_arg(base_period, "match", .choices = c("varying", "universal"), .up = 1) - check_arg(copy, validate, boot, allow_unbalance_panel, cband, "scalar logical", .up = 1) + check_arg(copy, validate, boot, allow_unbalance_panel, cband, parallel, "scalar logical", .up = 1) check_arg(anticipation, alpha, "scalar numeric", .up = 1) if(!is.na(balanced_event_time)){ if(result_type != "dynamic"){stop("balanced_event_time is only meaningful with result_type == 'dynamic'")} check_arg(balanced_event_time, "numeric scalar", .up = 1) } - if(allow_unbalance_panel == TRUE & control_type %in% c("dr", "reg")){ - stop("fastdid currently only supprts ipw when allowing for unbalanced panels.") - } - if(allow_unbalance_panel == TRUE & !allNA(varycovariatesvar)){ - stop("fastdid currently only supprts time varying covariates when allowing for unbalanced panels.") + if(allow_unbalance_panel == TRUE & control_type == "dr"){ + stop("fastdid does not support DR when allowing for unbalanced panels.") } + # if(allow_unbalance_panel == TRUE & !allNA(varycovariatesvar)){ + # stop("fastdid currently only supprts time varying covariates when not allowing for unbalanced panels.") + # } if(any(covariatesvar %in% varycovariatesvar) & !allNA(varycovariatesvar) & !allNA(covariatesvar)){ stop("time-varying var and invariant var have overlaps.") } @@ -1337,6 +1766,11 @@ validate_argument <- function(dt, p){ stop("clustering and uniform confidence interval only available with bootstrap") } + if(parallel){ + if(.Platform$OS.type != "unix"){stop("parallel option only available on unix sysytems")} + if(!requireNamespace("parallel")){stop("parallel requires the parallel package")} + } + # varname collision varnames <- unlist(p[str_subset(names(p), "var")]) varnames <- varnames[!is.na(varnames)] @@ -1346,7 +1780,7 @@ validate_argument <- function(dt, p){ validate_dt <- function(dt, p){ varnames <- unlist(p[str_ends(names(p), "var")], recursive = TRUE) #get all the argument that ends with "var" - varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar)] + varnames <- varnames[!varnames %in% c(p$timevar, p$unitvar, p$cohortvar) & !is.na(varnames) & !is.null(varnames)] #change to int uniquecols <- c("G", "time", "unit") @@ -1362,13 +1796,14 @@ validate_dt <- function(dt, p){ if(p$balanced_event_time > dt[, max(time-G)]){stop("balanced_event_time is larger than the max event time in the data")} } - #doesn't allow missing value for now - for(col in varnames){ - if(is.na(col)){next} - na_obs <- whichNA(dt[[col]]) - if(length(na_obs) != 0){ - warning("missing values detected in ", col, ", removing ", length(na_obs), " observation.") - dt <- dt[!na_obs] + #doesn't allow missing value + if(is.na(p$exper$only_balance_2by2) | !p$exper$only_balance_2by2){ + for(col in varnames){ + na_obs <- whichNA(dt[,get(col)]) + if(length(na_obs) != 0){ + warning("missing values detected in ", col, ", removing ", length(na_obs), " observation.") + dt <- dt[!na_obs] + } } } @@ -1382,6 +1817,7 @@ validate_dt <- function(dt, p){ if(is.na(cov)){next} #check covaraites is not constant if(fnunique(dt[, get(cov)[1], by = "unit"][, V1]) == 1)stop(cov, " have no variation") + if(!(is.numeric(dt[, get(cov)])|is.integer(dt[, get(cov)]))){stop(cov, " is not numeric or integer, do not support fixed effects.")} } } @@ -1402,6 +1838,24 @@ validate_dt <- function(dt, p){ } } + # drop always_treated units + always_treated <- dt[G <= min(time), unique(unit)] + if(length(always_treated) > 0){ + warning(length(always_treated), " units is treated in the first period, dropping them") + dt <- dt[!unit %in% always_treated] + } + + #for double did part + if(!is.na(p$cohortvar2)){ + always_treated <- dt[G2 <= min(time), unique(unit)] + if(length(always_treated) > 0){ + warning(length(always_treated), " units is treated in the first period, dropping them") + dt <- dt[!unit %in% always_treated] + } + } + + + return(dt) } diff --git a/development/playground.R b/development/playground.R index 3f54ba3..2413bcc 100644 --- a/development/playground.R +++ b/development/playground.R @@ -1,17 +1,55 @@ -setwd("~/Documents/GitHub/fastdid") +#setwd("~/Documents/GitHub/fastdid") +library(here) library(devtools) library(tinytest) library(roxygen2) +library(profvis) + +setwd(here()) load_all() +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+05, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = FALSE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) +dt <- simdt$dt + +result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", + result_type = "group_time") + tol <- 1e-2 #allow 1% different between estimates -simdt <- sim_did(1e+05, 30, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, - stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +simdt <- sim_did(1e+04, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = FALSE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) dt <- simdt$dt +result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", + result_type = "group_time", exper = list(cohortvar2 = "G2", event_specific = TRUE)) + +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+04, 20, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = FALSE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) +dt <- simdt$dt profvis(result <-fastdid(data = dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", - result_type = "group_time", copy = FALSE)) \ No newline at end of file + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", + result_type = "group_time", exper = list(cohortvar2 = "G2", event_specific = TRUE))) + +profvis(result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", + result_type = "group_time", exper = list(cohortvar2 = "G2", event_specific = TRUE), + parallel = TRUE)) + + +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+06, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = FALSE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) +dt <- simdt$dt + +profvis( +result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", covariatesvar = c("x", "x2"), control_type = "ipw", + result_type = "group_time") +) diff --git a/development/profile.R b/development/profile.R index 97deffd..be9ae84 100644 --- a/development/profile.R +++ b/development/profile.R @@ -1,22 +1,47 @@ -profvis(run_fastdid(dt)) +library(profvis) -profvis(run_did(dt)) +setDTthreads(0) +options(mc.cores = getDTthreads()) + +#simple +simdt <- sim_did(1e+06, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +dt <- simdt$dt profvis({ result <- fastdid(dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", - covariatesvar = "x") + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time") }) -#covariates multiple outcome +gc() + +#simple parallel +simdt <- sim_did(1e+06, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +dt <- simdt$dt + profvis({ result <- fastdid(dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = c("y", "y2"), result_type = "group_time", - covariatesvar = "x") + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", parallel = TRUE) }) -bm_time <- microbenchmark(result <- fastdid(dt, - timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", - covariatesvar = "x"), - times = 1) -message(bm_time$time/10^9) \ No newline at end of file +#double +simdt <- sim_did(1e+06, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) +dt <- simdt$dt + +profvis({ + result <- fastdid(dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2") +}) + +#double parallel +simdt <- sim_did(1e+06, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +dt <- simdt$dt + +profvis({ + result <- fastdid(dt, + timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time") +}) \ No newline at end of file diff --git a/development/release_check.R b/development/release_check.R index 6a4bddc..3f3df78 100644 --- a/development/release_check.R +++ b/development/release_check.R @@ -1,4 +1,4 @@ -setwd("~/Documents/GitHub/fastdid") + library(devtools) library(tinytest) @@ -12,7 +12,10 @@ load_all() run_test_dir() #before release +build(path = "development") check() +cov <- package_coverage(function_exclusions = c("sim_did", "locked")) + source("development/build_source.R") -build_source("0.9.4") \ No newline at end of file +build_source("0.9.9") \ No newline at end of file diff --git a/inst/tinytest/test_1_fastdid.R b/inst/tinytest/test_1_fastdid.R index 0116b11..94dc448 100644 --- a/inst/tinytest/test_1_fastdid.R +++ b/inst/tinytest/test_1_fastdid.R @@ -45,10 +45,17 @@ expect_silent(fastdid(dt[G != 3], timevar = "time", cohortvar = "G", unitvar = " expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", alpha = 0.01), info = "alternative alpha") -dt2 <- copy(dt) -setnames(dt2, c("time", "G", "unit"), c("t", "g", "u")) -expect_silent(fastdid(dt2, timevar = "t", cohortvar = "g", unitvar = "u",outcomevar = "y", result_type = "group_time", alpha = 0.01), - info = "other column names") +#get full result +full_res <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = c("y", "y2"), result_type = "group_time", full = TRUE) +expect_equal(names(full_res), c("call", "estimate", "gt_estimate", "agg_inf_func", "agg_weight_matrix"), + info = "full result") + +units <- dt[, unique(unit)] +weights <- data.table::data.table(unit = units, w = rnorm(length(units), 1, 1)) +dt2 <- dt |> merge(weights, by = "unit") +expect_silent(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + weightvar = "w"), + info = "weighted") # bootstrap part ------------------------ @@ -83,10 +90,6 @@ expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",ou base_period = "varying"), info = "baseperiod vary") -#plot -expect_silent(plot_did_dynamics(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic")), - info = "simple plot") - # dt that needs adjustment --------------------------- base_result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time") @@ -94,10 +97,10 @@ base_result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",o expect_equal(fastdid(dt[nrow(dt):1], timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), base_result, info = "reversed") -dt2 <- copy(dt) +dt2 <- data.table::copy(dt) dt2[, time := time*2 + 3] dt2[, G := G*2 + 3] -base_result2 <- copy(base_result) +base_result2 <- data.table::copy(base_result) base_result2[, cohort := cohort*2 + 3] base_result2[, time := time*2 + 3] @@ -107,7 +110,7 @@ expect_equal(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",ou # unbalanced panel ---------------------------------------------------- -dt2 <- copy(dt) +dt2 <- data.table::copy(dt) keep <- sample(c(rep(TRUE, 19),FALSE), dt2[,.N], TRUE) dt2 <- dt2[keep] @@ -116,43 +119,24 @@ expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",ou allow_unbalance_panel = TRUE), info = "balance panel false but dt is balance") -# throw error / warning at problematic dt ------------------------- - -expect_error(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - boot = FALSE, clustervar = "x"), - info = "clustered but no boot") - -expect_error(fastdid(dt[time != 3], timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), - info = "missing time") - -expect_warning(fastdid(dt[time != 3 | unit != 20], timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), - info = "non-balanced panel, missing") - -extra_row <- dt[unit == 20 & time == 3] -expect_error(fastdid(rbind(dt,extra_row), timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), - info = "non-balanced panel, extra") - -dt2 <- copy(dt) -dt2[, x := 1] -expect_error(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", - covariatesvar = "x"), - info = "covariates with no variation") - -dt2 <- copy(dt) -dt2[unit == 1 & time < 5, x := 3] -expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", - covariatesvar = "x"), - info = "time varying covariates is warned") - -dt2 <- copy(dt) -dt2[unit == 1 & time == 4, time := NA] -dt2[time == 3 & unit > 30, y := NA] -expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time"), - info = "missing values") - -#right error message -errormes <- "Error: in fastdid(dt, timevar = \"time\", cohortvar...:\n outcomevar must be NA or a character vector which are all names of columns from the dataset. Problem: i) is.na returns\n FALSE, or ii) no match was found for 'zz'.\n" -mes <- as.character(tryCatch(fastdid(dt, timevar = "time", cohortvar = "g", unitvar = "unit", outcomevar = "zz", result_type = "group_time"), - error = function(e){return(e)})) -expect_equal(errormes, mes, - info = "wrong col name") +# internal behavor ------------------------- + + + +#make sure dt is not copied if copy == FALSE +dtc <- data.table::copy(dt) +address <- tracemem(dtc) |> stringr::str_remove_all(">|<") +out <- capture.output(fastdid(dtc, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", copy = FALSE)) +#if a copy happen there will be a message at the start, so out[1] won't be the header. +expect_true(!stringr::str_detect(stringr::str_flatten_comma(out), address), + info = "no unintentional copy") + +dt2 <- data.table::copy(dt) +data.table::setnames(dt2, c("time", "G", "unit"), c("t", "g", "u")) +expect_silent(fastdid(dt2, timevar = "t", cohortvar = "g", unitvar = "u",outcomevar = "y", result_type = "group_time", alpha = 0.01), + info = "other column names") + +if(.Platform$OS.type == "unix" & at_home() & requireNamespace("parallel")){ + expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", parallel = TRUE), + info = "parallel") +} diff --git a/inst/tinytest/test_2_compare_est.R b/inst/tinytest/test_2_compare_est.R new file mode 100644 index 0000000..117ca9c --- /dev/null +++ b/inst/tinytest/test_2_compare_est.R @@ -0,0 +1,371 @@ +# setup ---------------------- + +library(did) +#set.seed(1) +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+03, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, + seed = 1, stratify = FALSE, + second_cov = TRUE) +dt <- simdt$dt + +est_diff_ratio <- function(result, did_result){ + did_result_dt <- data.table::data.table(cohort = did_result$group, time = did_result$t, did_att = did_result$att, did_se = did_result$se) + did_result_dt <- did_result_dt[did_att != 0] + compare <- did_result_dt |> merge(result, by = c("cohort", "time"), all = TRUE) + att_diff_per <- compare[, sum(abs(did_att-att), na.rm = TRUE)/sum(did_att, na.rm = TRUE)] + se_diff_per <- compare[, sum(abs(did_se-se), na.rm = TRUE)/sum(did_se, na.rm = TRUE)] + return(c(att_diff_per , se_diff_per)) +} + +est_diff_ratio_agg <- function(result, did_result){ + names(result) <- c("target", "att", "se", "outcome", "atub", "atlb") + did_result_dt <- data.table::data.table(target = did_result$egt, did_att = did_result$att.egt, did_se = did_result$se.egt) + compare <- did_result_dt |> merge(result, by = c("target"), all = TRUE) + att_diff_per <- compare[, sum(abs(did_att-att), na.rm = TRUE)/sum(did_att, na.rm = TRUE)] + se_diff_per <- compare[, sum(abs(did_se-se), na.rm = TRUE)/sum(did_se, na.rm = TRUE)] + return(c(att_diff_per , se_diff_per)) +} + +# simple ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time") +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "simple") +rm(result, did_result) + +# simple notyet only ---------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", control_option = "notyet") +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt[!is.infinite(G)],base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "notyet") +rm(result, did_result) + +# never ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_option = "never") +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "nevertreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "never treated") +rm(result, did_result) + +# covariates ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "ipw", + covariatesvar = c("x", "x2")) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + xformla = ~x+x2, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "covariates ipw") +rm(result, did_result) + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "reg", + covariatesvar = c("x", "x2")) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "reg",cband = FALSE, + xformla = ~x+x2, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "covariates reg") +rm(result, did_result) + + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "dr", + covariatesvar = c("x")) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "dr",cband = FALSE, + xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "covariates dr") +rm(result, did_result) + +# anticipation ----------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "dr", + covariatesvar = c("x"), anticipation = 2) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "dr",cband = FALSE, + xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE, anticipation = 2) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "anticipation") +rm(result, did_result) + +# alternative base period ----------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "dr", + covariatesvar = c("x"), base_period = "varying") +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "varying",est_method = "dr",cband = FALSE, + xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "varying base period") +rm(result, did_result) + +# bootstrap -------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + boot = TRUE, biters = 10000) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + control_group = "notyettreated", + clustervars = NULL, + bstrap = TRUE, biters = 10000) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "bootstrap") +rm(result, did_result) + +# bootstrap clustered -------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + clustervar = "x", + boot = TRUE, biters = 10000) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + control_group = "notyettreated", + clustervars = "x", + bstrap = TRUE, biters = 10000) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "bootstrap clustered") +rm(result, did_result) + +#multiple outcome ------------------------------------------ + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = c("y", "y2"), result_type = "group_time") +did_result <- did::att_gt(yname = "y2",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result[outcome == "y2"], did_result), c(0,0), tolerance = tol, + info = "simple multiple outcome") +rm(result, did_result) + + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = c("y", "y2"), result_type = "group_time", + covariatesvar = "x") +did_result <- did::att_gt(yname = "y2",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result[outcome == "y2"], did_result), c(0,0), tolerance = tol, + info = "multiple outcome with covariates") +rm(result, did_result) + + + +# dynamic ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic") +did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) +did_result <- did::aggte(did_result_gt, type = "dynamic") +expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, + info = "dynamic") +rm(result, did_result) + +# group ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group") +result <- result[type == "post",] +result[, type := NULL] +did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) +did_result <- did::aggte(did_result_gt, type = "group") +expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, + info = "group") +rm(result, did_result) + +# time -------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "time") +result <- result[type == "post",] +result[, type := NULL] +did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) +did_result <- did::aggte(did_result_gt, type = "calendar") +expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, + info = "time") +rm(result, did_result) + +# bootstrap clustered -------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "simple") +did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) +did_result <- did::aggte(did_result_gt, type = "simple") +diffs <- c((did_result$overall.att-result[type == "post", att])/did_result$overall.att, + (did_result$overall.se-result[type == "post", se])/did_result$overall.se) +expect_equal(diffs, c(0,0), tolerance = tol, + info = "simple group") +rm(result, did_result) + +# dynamic with balanced cohort ------------------- + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic", + balanced_event_time = 4) +did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + clustervars = NULL, + bstrap = FALSE) +did_result <- did::aggte(did_result_gt, type = "dynamic", + balance_e = 4) +expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, + info = "dynamic with balanced cohort") +rm(result, did_result) + +# unbalanced ------------------------------ + +dt2 <- data.table::copy(dt) +keep <- sample(c(rep(TRUE, 15),FALSE), dt2[,.N], TRUE) +dt2 <- dt2[keep] + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + allow_unbalance_panel = TRUE) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + allow_unbalanced_panel = TRUE, + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "unbalanced method, balance panel, simple") +rm(result, did_result) + +result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + allow_unbalance_panel = TRUE) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2,base_period = "universal",est_method = "ipw",cband = FALSE, + #xformla = ~x, + control_group = "notyettreated", + allow_unbalanced_panel = TRUE, + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "unbalanced method, unbalance panel, simple") +rm(result, did_result) + +# unbalanced, ipw ------------------------------------------ + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "ipw", + covariatesvar = c("x", "x2"), + allow_unbalance_panel = TRUE) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",cband = FALSE, + est_method = "ipw",xformla = ~x+x2, + control_group = "notyettreated", + allow_unbalanced_panel = TRUE, + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "unbalanced method, balance panel, ipw") +rm(result, did_result) + +result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", + result_type = "group_time", + control_type = "ipw", + covariatesvar = c("x", "x2"), + allow_unbalance_panel = TRUE) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2, + base_period = "universal",cband = FALSE, + est_method = "ipw",xformla = ~x+x2, + control_group = "notyettreated", + allow_unbalanced_panel = TRUE, + clustervars = NULL, + bstrap = FALSE) + +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "unbalanced method,unbalanced panel, data unbalanced, ipw") +rm(result, did_result) + +# unbalanced, reg -------------------------------------------------------------------------- + +# reg and dr unbalanced will be used after the OR influence issues is resolved + +result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + control_type = "reg", + covariatesvar = c("x", "x2"), + allow_unbalance_panel = TRUE) +did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",cband = FALSE, + est_method = "reg",xformla = ~x+x2, + control_group = "notyettreated", + allow_unbalanced_panel = TRUE, + clustervars = NULL, + bstrap = FALSE) +expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, + info = "unbalanced panel reg, but data balanced") +rm(result, did_result) + +# estimate not the same, see coverage test +# result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", +# control_type = "reg", +# covariatesvar = c("x", "x2"), +# allow_unbalance_panel = TRUE) +# did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2 ,base_period = "universal",cband = FALSE, +# est_method = "reg",xformla = ~x+x2, +# control_group = "notyettreated", +# allow_unbalanced_panel = TRUE, +# clustervars = NULL, +# bstrap = FALSE) +# +# expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, +# info = "unbalanced panel reg, but data unbalanced") +# rm(result, did_result) + +#clean up environment ------------------------- +rm(dt, simdt, tol, est_diff_ratio_agg) +detach("package:did", unload=TRUE) \ No newline at end of file diff --git a/inst/tinytest/test_2_compare_gt.R b/inst/tinytest/test_2_compare_gt.R deleted file mode 100644 index fa18ec6..0000000 --- a/inst/tinytest/test_2_compare_gt.R +++ /dev/null @@ -1,185 +0,0 @@ -# setup ---------------------- - -library(did) -#set.seed(1) -tol <- 1e-2 #allow 1% different between estimates -simdt <- sim_did(1e+03, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, - seed = 1, stratify = FALSE, - second_cov = TRUE) -dt <- simdt$dt - -est_diff_ratio <- function(result, did_result){ - did_result_dt <- data.table(cohort = did_result$group, time = did_result$t, did_att = did_result$att, did_se = did_result$se) - did_result_dt <- did_result_dt[did_att != 0] - compare <- did_result_dt |> merge(result, by = c("cohort", "time"), all = TRUE) - att_diff_per <- compare[, sum(abs(did_att-att), na.rm = TRUE)/sum(did_att, na.rm = TRUE)] - se_diff_per <- compare[, sum(abs(did_se-se), na.rm = TRUE)/sum(did_se, na.rm = TRUE)] - return(c(att_diff_per , se_diff_per)) -} - -# simple ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time") -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "simple") -rm(result, did_result) - -# simple notyet only ---------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", control_option = "notyet") -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt[!is.infinite(G)],base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "notyet") -rm(result, did_result) - -# never ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_option = "never") -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "nevertreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "never treated") -rm(result, did_result) - -# covariates ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "ipw", - covariatesvar = c("x", "x2")) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - xformla = ~x+x2, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "covariates ipw") -rm(result, did_result) - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "reg", - covariatesvar = c("x", "x2")) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "reg",cband = FALSE, - xformla = ~x+x2, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "covariates reg") -rm(result, did_result) - - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "dr", - covariatesvar = c("x")) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "dr",cband = FALSE, - xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "covariates dr") -rm(result, did_result) - -# anticipation ----------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "dr", - covariatesvar = c("x"), anticipation = 2) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "dr",cband = FALSE, - xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE, anticipation = 2) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "anticipation") -rm(result, did_result) - -# alternative base period ----------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "dr", - covariatesvar = c("x"), base_period = "varying") -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "varying",est_method = "dr",cband = FALSE, - xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "varying base period") -rm(result, did_result) - -# bootstrap -------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - boot = TRUE, biters = 10000) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - control_group = "notyettreated", - clustervars = NULL, - bstrap = TRUE, biters = 10000) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "bootstrap") -rm(result, did_result) - -# bootstrap clustered -------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - clustervar = "x", - boot = TRUE, biters = 10000) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - control_group = "notyettreated", - clustervars = "x", - bstrap = TRUE, biters = 10000) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "bootstrap clustered") -rm(result, did_result) - -#multiple outcome ------------------------------------------ - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = c("y", "y2"), result_type = "group_time") -did_result <- did::att_gt(yname = "y2",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result[outcome == "y2"], did_result), c(0,0), tolerance = tol, - info = "simple multiple outcome") -rm(result, did_result) - - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = c("y", "y2"), result_type = "group_time", - covariatesvar = "x") -did_result <- did::att_gt(yname = "y2",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result[outcome == "y2"], did_result), c(0,0), tolerance = tol, - info = "multiple outcome with covariates") -rm(result, did_result) -#clean up environment -detach("package:did", unload=TRUE) \ No newline at end of file diff --git a/inst/tinytest/test_3_compare_agg.R b/inst/tinytest/test_3_compare_agg.R deleted file mode 100644 index b58639b..0000000 --- a/inst/tinytest/test_3_compare_agg.R +++ /dev/null @@ -1,91 +0,0 @@ -# setup ---------------------- - -library(did) - -tol <- 1e-2 #allow 1% different between estimates -simdt <- sim_did(1e03, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, stratify = FALSE) -dt <- simdt$dt - -est_diff_ratio_agg <- function(result, did_result){ - names(result) <- c("target", "att", "se", "outcome", "atub", "atlb") - did_result_dt <- data.table(target = did_result$egt, did_att = did_result$att.egt, did_se = did_result$se.egt) - compare <- did_result_dt |> merge(result, by = c("target"), all = TRUE) - att_diff_per <- compare[, sum(abs(did_att-att), na.rm = TRUE)/sum(did_att, na.rm = TRUE)] - se_diff_per <- compare[, sum(abs(did_se-se), na.rm = TRUE)/sum(did_se, na.rm = TRUE)] - return(c(att_diff_per , se_diff_per)) -} - -# dynamic ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic") -did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) -did_result <- did::aggte(did_result_gt, type = "dynamic") -expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, - info = "dynamic") -rm(result, did_result) - -# group ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group") -did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) -did_result <- did::aggte(did_result_gt, type = "group") -expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, - info = "group") -rm(result, did_result) - -# time -------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "time") -did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) -did_result <- did::aggte(did_result_gt, type = "calendar") -expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, - info = "time") -rm(result, did_result) - -# bootstrap clustered -------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "simple") -did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) -did_result <- did::aggte(did_result_gt, type = "simple") -diffs <- c((did_result$overall.att-result[type == "post", att])/did_result$overall.att, - (did_result$overall.se-result[type == "post", se])/did_result$overall.se) -expect_equal(diffs, c(0,0), tolerance = tol, - info = "simple group") -rm(result, did_result) - -# dynamic with balanced cohort ------------------- - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic", - balanced_event_time = 4) -did_result_gt <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - clustervars = NULL, - bstrap = FALSE) -did_result <- did::aggte(did_result_gt, type = "dynamic", - balance_e = 4) -expect_equal(est_diff_ratio_agg(result, did_result), c(0,0), tolerance = tol, - info = "dynamic with balanced cohort") -rm(result, did_result) - - -#clean up environment ------------------------- -rm(dt, simdt, tol, est_diff_ratio_agg) -detach("package:did", unload=TRUE) - diff --git a/inst/tinytest/test_3_message.R b/inst/tinytest/test_3_message.R new file mode 100644 index 0000000..871870c --- /dev/null +++ b/inst/tinytest/test_3_message.R @@ -0,0 +1,83 @@ +# setup -------------------------------------------------------- + +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+02, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +dt <- simdt$dt + +# wrong dt --------------- + +expect_warning(fastdid(as.data.frame(dt), timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), + info = "data.frame") + +dt2 <- data.table::copy(dt) +dt2 <- dt2[!is.infinite(G)] + +expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), + info = "no never") + +expect_error(fastdid(dt[time != 3], timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), + info = "missing time") + +expect_warning(fastdid(dt[time != 3 | unit != 20], timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), + info = "non-balanced panel, missing") + +extra_row <- dt[unit == 20 & time == 3] +expect_error(fastdid(rbind(dt,extra_row), timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time"), + info = "non-balanced panel, extra") + +dt2 <- data.table::copy(dt) +dt2[, x := 1] +expect_error(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", + covariatesvar = "x"), + info = "covariates with no variation") + +dt2 <- data.table::copy(dt) +dt2[G == Inf, x := x+10, by = "unit"] +expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", + covariatesvar = "x", control_type = "ipw", control_option = "never"), + info = "covariates with no overlap") + +# warning for problematic dt ---------- + +dt2 <- data.table::copy(dt) +dt2[unit == 1 & time < 5, x := 3] +expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", + covariatesvar = "x"), + info = "time varying covariates is warned") + +dt2 <- data.table::copy(dt) +dt2[unit == 1 & time == 4, time := NA] +dt2[time == 3 & unit > 30, y := NA] +expect_warning(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time"), + info = "missing values") + + + +# already_treated +dt_at <- data.table::copy(dt) +dt_at <- dt_at[time > min(G)] +expect_warning(fastdid(dt_at, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time"), + info = "already treated") + +dt2 <- data.table::copy(dt) +dt2[, x := as.factor(round(x))] +expect_error(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time", covariatesvar = "x"), + info = "only numeric covariate") + +# wrong arguments ------------------------------------------------- + +expect_error(fastdid::fastdid(dt, timevar = "time", cohortvar = "g", unitvar = "unit", outcomevar = "zz", result_type = "group_time"), + info = "wrong col name") + +expect_error(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + boot = FALSE, clustervar = "x"), + info = "clustered but no boot") + +expect_error(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + boot = FALSE, clustervar = "x"), + info = "clustered but no boot") + +expect_error(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + varycovariatesvar = "x2", allow_unbalanced_panel = TRUE), + info = "unbalanced but vary") \ No newline at end of file diff --git a/inst/tinytest/test_4_double.R b/inst/tinytest/test_4_double.R new file mode 100644 index 0000000..efd4f57 --- /dev/null +++ b/inst/tinytest/test_4_double.R @@ -0,0 +1,66 @@ +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+02, 5, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = FALSE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE, second_cohort = TRUE) +dt <- simdt$dt + +# basic tests ------------------------------------------------------------------ + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2"), + info = "double call") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", event_specific = FALSE), + info = "double, combined effect") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", allow_unbalance_panel = TRUE), + info = "double, unbalanced panel") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", control_option = "never"), + info = "double, only never") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", double_control_option = "never"), + info = "double call, double never") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", double_control_option = "notyet"), + info = "double cal, double notyet") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + covariatesvar = "x", + cohortvar2 = "G2"), + info = "double, covariates") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_group_time", + cohortvar2 = "G2"), + info = "double, ggt") + +expect_silent(fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "dynamic_stagger", + cohortvar2 = "G2"), + info = "double, dynamic_sq") + +dt2 <- data.table::copy(dt) +keep <- sample(c(rep(TRUE, 15),FALSE), dt2[,.N], TRUE) +dt2 <- dt2[keep] +expect_silent(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", allow_unbalance_panel = TRUE), + info = "double, unbalanced panel, unbalanced data") + +dt2 <- data.table::copy(dt) +dt2[, time := time*2 + 3] +dt2[, G := G*2 + 3] +dt2[, G2 := G2*2 + 3] + +expect_silent(fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2"), + info = "time offset") + +if(.Platform$OS.type == "unix" & at_home() & requireNamespace("parallel")){ + expect_silent(fastdid(dt, timevar = "time",cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", parallel = TRUE), + info = "parallel double") +} + diff --git a/inst/tinytest/test_4_unbalanced.R b/inst/tinytest/test_4_unbalanced.R deleted file mode 100644 index ddfceb1..0000000 --- a/inst/tinytest/test_4_unbalanced.R +++ /dev/null @@ -1,151 +0,0 @@ -# setup ---------------------- - -library(did) - -tol <- 1e-2 #allow 1% different between estimates -simdt <- sim_did(1e03, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, stratify = FALSE, - second_cov = TRUE) -dt <- simdt$dt -est_diff_ratio <- function(result, did_result){ - did_result_dt <- data.table(cohort = did_result$group, time = did_result$t, did_att = did_result$att, did_se = did_result$se) - compare <- did_result_dt |> merge(result, by = c("cohort", "time"), all = TRUE) - att_diff_per <- compare[, sum(abs(did_att-att), na.rm = TRUE)/sum(did_att, na.rm = TRUE)] - se_diff_per <- compare[, sum(abs(did_se-se), na.rm = TRUE)/sum(did_se, na.rm = TRUE)] - return(c(att_diff_per , se_diff_per)) -} -# unbalanced ------------------------------ - -dt2 <- copy(dt) -keep <- sample(c(rep(TRUE, 15),FALSE), dt2[,.N], TRUE) -dt2 <- dt2[keep] - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - allow_unbalance_panel = TRUE) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - allow_unbalanced_panel = TRUE, - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "unbalanced method, balance panel, simple") -rm(result, did_result) - -result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - allow_unbalance_panel = TRUE) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2,base_period = "universal",est_method = "ipw",cband = FALSE, - #xformla = ~x, - control_group = "notyettreated", - allow_unbalanced_panel = TRUE, - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "unbalanced method, unbalance panel, simple") -rm(result, did_result) - -# unbalanced method with balanced panel ------------------------------------------ - -result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", - control_type = "ipw", - covariatesvar = c("x", "x2"), - allow_unbalance_panel = TRUE) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",cband = FALSE, - est_method = "ipw",xformla = ~x+x2, - control_group = "notyettreated", - allow_unbalanced_panel = TRUE, - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "unbalanced method, balance panel, ipw") -rm(result, did_result) - -result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", - result_type = "group_time", - control_type = "ipw", - covariatesvar = c("x", "x2"), - allow_unbalance_panel = TRUE) -did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2, - base_period = "universal",cband = FALSE, - est_method = "ipw",xformla = ~x+x2, - control_group = "notyettreated", - allow_unbalanced_panel = TRUE, - clustervars = NULL, - bstrap = FALSE) - -expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, - info = "unbalanced method,unbalanced panel, data unbalanced, ipw") -rm(result, did_result) - -# reg and dr unbalanced will be used after the OR influence issues is resolved - -# result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", -# control_type = "reg", -# covariatesvar = c("x", "x2"), -# allow_unbalance_panel = TRUE) -# did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",cband = FALSE, -# est_method = "reg",xformla = ~x+x2, -# control_group = "notyettreated", -# allow_unbalanced_panel = TRUE, -# clustervars = NULL, -# bstrap = FALSE) -# -# expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, -# info = "unbalanced panel reg, but data balanced") -# rm(result, did_result) -# -# result <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", -# control_type = "dr", -# covariatesvar = c("x", "x2"), -# allow_unbalance_panel = TRUE) -# did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt,base_period = "universal",cband = FALSE, -# est_method = "dr",xformla = ~x+x2, -# control_group = "notyettreated", -# allow_unbalanced_panel = TRUE, -# clustervars = NULL, -# bstrap = FALSE) -# -# expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, -# info = "unbalanced panel dr, but data balanced") -# rm(result, did_result) -# -# -# # unbalanced panel unbalanced method ------------------------------------------------------------------ -# -# -# -# -# result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", -# control_type = "reg", -# covariatesvar = c("x", "x2"), -# allow_unbalance_panel = TRUE) -# did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2,base_period = "universal",cband = FALSE, -# est_method = "reg",xformla = ~x+x2, -# control_group = "notyettreated", -# allow_unbalanced_panel = TRUE, -# clustervars = NULL, -# bstrap = FALSE) -# -# expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, -# info = "unbalanced panel reg, data unbalanced") -# rm(result, did_result) -# -# result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", -# control_type = "dr", -# covariatesvar = c("x", "x2"), -# allow_unbalance_panel = TRUE) -# did_result <- did::att_gt(yname = "y",gname = "G",idname = "unit",tname = "time",data = dt2,base_period = "universal",cband = FALSE, -# est_method = "dr",xformla = ~x+x2, -# control_group = "notyettreated", -# allow_unbalanced_panel = TRUE, -# clustervars = NULL, -# bstrap = FALSE) -# -# expect_equal(est_diff_ratio(result, did_result), c(0,0), tolerance = tol, -# info = "unbalanced panel dr, data unbalanced") -# rm(result, did_result) - -#clean up environment -detach("package:did", unload=TRUE) \ No newline at end of file diff --git a/inst/tinytest/test_5_generics.R b/inst/tinytest/test_5_generics.R new file mode 100644 index 0000000..d046744 --- /dev/null +++ b/inst/tinytest/test_5_generics.R @@ -0,0 +1,17 @@ +# setup -------------------------------------------------------- + +tol <- 1e-2 #allow 1% different between estimates +simdt <- sim_did(1e+02, 10, cov = "cont", hetero = "all", balanced = TRUE, second_outcome = TRUE, seed = 1, + stratify = FALSE, second_cov = TRUE, vary_cov = TRUE) +dt <- simdt$dt + +# did plot --------------- + +fd <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "dynamic") +expect_inherits(plot_did_dynamics(fd), "ggplot", info = "plot dynamics") + +fd <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group") +expect_inherits(plot_did_dynamics(fd, margin = "group"), "ggplot", info = "plot group") + +fd <- fastdid(dt, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "time") +expect_inherits(plot_did_dynamics(fd, margin = "time"), "ggplot", info = "plot time") diff --git a/inst/tinytest/test_99_coverage.R b/inst/tinytest/test_99_coverage.R new file mode 100644 index 0000000..ecf3f49 --- /dev/null +++ b/inst/tinytest/test_99_coverage.R @@ -0,0 +1,69 @@ +#setup + +#takes a lot of time so only at home +if(at_home() & .Platform$OS.type == "unix" & requireNamespace("parallel")){ + + #gotta go fast + setDTthreads(0) + options(mc.cores = getDTthreads()) + + + cov_error_margin <- 0.01 + iter <- 200 + + #doubledid boot + covers <- c() + pb <- progress::progress_bar$new(total = iter) + for(i in 1:iter){ + simdt_c <- sim_did(1e+04, 5, hetero = "all", balanced = TRUE, second_outcome = TRUE, second_cohort = TRUE) + dt_c <- simdt_c$dt + result <- fastdid(dt_c, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", event_specific = TRUE, alpha = 0.05, boot = TRUE, cband = FALSE, parallel = TRUE) + cov <- merge(simdt_c$att[event == 1], result, by.x = c("G", "time"), by.y = c("cohort", "time")) + cov[, cover := as.numeric(attgt < att_ciub & attgt > att_cilb)] + covers <- c(covers, cov[, cover]) + pb$tick() + } + mc <- mean(covers) + # the standard error is sqrt(0.95*0.05/3200), so may wrongfully reject about 1% of time + expect_true(mc > 0.95-cov_error_margin & mc < 0.95+cov_error_margin, info = "double did, bootstrap coverage") + + rm(mc, covers, simdt_c, dt_c, cov) + + #double did, analytical + pb <- progress::progress_bar$new(total = iter) + covers <- c() + for(i in 1:iter){ + simdt_c <- sim_did(1e+04, 5, hetero = "all", balanced = TRUE, second_outcome = TRUE, second_cohort = TRUE) + dt_c <- simdt_c$dt + result <- fastdid(dt_c, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + cohortvar2 = "G2", event_specific = TRUE, alpha = 0.05, boot = FALSE, cband = FALSE, parallel = TRUE) + cov <- merge(simdt_c$att[event == 1], result, by.x = c("G", "time"), by.y = c("cohort", "time")) + cov[, cover := as.numeric(attgt < att_ciub & attgt > att_cilb)] + covers <- c(covers, cov[, cover]) + pb$tick() + } + mc <- mean(covers) + # the standard error is sqrt(0.95*0.05/3200), so may wrongfully reject about 1% of time + expect_true(mc > 0.95-cov_error_margin & mc < 0.95+cov_error_margin, info = "double did, analytical coverage") + + #unbalanced panel OR + pb <- progress::progress_bar$new(total = iter) + covers <- c() + for(i in 1:iter){ + simdt_c <- sim_did(1e+03, 10, hetero = "all", balanced = TRUE, second_outcome = TRUE, cov = "cont") + dt <- simdt_c$dt + dt2 <- data.table::copy(dt) + keep <- sample(c(rep(TRUE, 15),FALSE), dt2[,.N], TRUE) + dt2 <- dt2[keep] + result <- fastdid(dt2, timevar = "time", cohortvar = "G", unitvar = "unit",outcomevar = "y", result_type = "group_time", + allow_unbalance_panel = TRUE, covariatesvar = "x", control_type = "reg", parallel = TRUE) + cov <- merge(simdt_c$att, result, by.x = c("G", "time"), by.y = c("cohort", "time")) + cov[, cover := as.numeric(attgt < att_ciub & attgt > att_cilb)] + covers <- c(covers, cov[, cover]) + pb$tick() + } + mc <- mean(covers) + # the standard error is sqrt(0.95*0.05/3200), so may wrongfully reject about 1% of time + expect_true(mc > 0.95-cov_error_margin & mc < 0.95+cov_error_margin, info = "unbalanced panel, regression") +} diff --git a/man/fastdid.Rd b/man/fastdid.Rd index 27edb64..7bf04bc 100644 --- a/man/fastdid.Rd +++ b/man/fastdid.Rd @@ -27,65 +27,87 @@ fastdid( validate = TRUE, anticipation = 0, base_period = "universal", - exper = NULL + exper = NULL, + full = FALSE, + parallel = FALSE, + cohortvar2 = NA, + event_specific = TRUE, + double_control_option = "both" ) } \arguments{ -\item{data}{A data.table containing the panel data.} +\item{data}{data.table, the dataset.} -\item{timevar}{The name of the time variable.} +\item{timevar}{character, name of the time variable.} -\item{cohortvar}{The name of the cohort (group) variable.} +\item{cohortvar}{character, name of the cohort (group) variable.} -\item{unitvar}{The name of the unit (id) variable.} +\item{unitvar}{character, name of the unit (id) variable.} -\item{outcomevar}{The name of the outcome variable.} +\item{outcomevar}{character vector, name(s) of the outcome variable(s).} -\item{control_option}{The control units used for the DiD estimates. Default is "both".} +\item{control_option}{character, control units used for the DiD estimates, options are "both", "never", or "notyet".} -\item{result_type}{A character string indicating the type of result to be returned. Default is "group_time".} +\item{result_type}{character, type of result to return, options are "group_time", "time", "group", "simple", "dynamic" (time since event), "group_group_time", or "dynamic_stagger".} -\item{balanced_event_time}{A numeric scalar that indicates the max event time to balance the cohort composition, only meaningful when result_type == "dynamic". Default is NA} +\item{balanced_event_time}{number, max event time to balance the cohort composition.} -\item{control_type}{The method for controlling for covariates. "ipw" for inverse probability weighting, "reg" for outcome regression, or "dr" for doubly-robust} +\item{control_type}{character, estimator for controlling for covariates, options are "ipw" (inverse probability weighting), "reg" (outcome regression), or "dr" (doubly-robust).} -\item{allow_unbalance_panel}{Whether allow unbalance panel as input (if false will coerce the dataset to a balanced panel). Default is FALSE} +\item{allow_unbalance_panel}{logical, allow unbalance panel as input or coerce dataset into one.} -\item{boot}{Logical, indicating whether bootstrapping should be performed. Default is FALSE} +\item{boot}{logical, whether to use bootstrap standard error.} -\item{biters}{The number of bootstrap iterations. Only relevant if boot = TRUE. Default is 1000.} +\item{biters}{number, bootstrap iterations. Default is 1000.} -\item{cband}{Logical, indicate whether to use uniform confidence band or point-wise, defulat is FALSE (use point-wise)} +\item{cband}{logical, whether to use uniform confidence band or point-wise.} -\item{alpha}{The significance level, default is 0.05} +\item{alpha}{number, the significance level. Default is 0.05.} -\item{weightvar}{The name of the weight variable, if not specified will cluster on unit level (optional).} +\item{weightvar}{character, name of the weight variable.} -\item{clustervar}{The name of the cluster variable, can only be used when boot == TRUE (optional).} +\item{clustervar}{character, name of the cluster variable.} -\item{covariatesvar}{A character vector containing the names of time-invariant covariate variables (optional).} +\item{covariatesvar}{character vector, names of time-invariant covariate variables.} -\item{varycovariatesvar}{A character vector containing the names of time-varying covariate variables (optional).} +\item{varycovariatesvar}{character vector, names of time-varying covariate variables.} -\item{copy}{whether to copy the dataset before processing, set to false to speed up the process, but the input data will be altered.} +\item{copy}{logical, whether to copy the dataset.} -\item{validate}{whether to validate the dataset before processing.} +\item{validate}{logical, whether to validate the dataset.} -\item{anticipation}{periods with aniticipation (delta in CS, default is 0, reference period is g - delta - 1).} +\item{anticipation}{number, periods with anticipation.} -\item{base_period}{same as did} +\item{base_period}{character, type of base period in pre-preiods, options are "universal", or "varying".} -\item{exper}{the list of experimental features, for features that are not in CSDID originally. Generally less tested.} +\item{exper}{list, arguments for experimental features.} + +\item{full}{logical, whether to return the full result (influence function, call, weighting scheme, etc,.).} + +\item{parallel}{logical, whether to use parallization on unix system.} + +\item{cohortvar2}{character, name of the second cohort (group) variable.} + +\item{event_specific}{logical, whether to recover target treatment effect or use combined effect.} + +\item{double_control_option}{character, control units used for the double DiD, options are "both", "never", or "notyet".} } \value{ -A data.table containing the estimated treatment effects and standard errors. +A data.table containing the estimated treatment effects and standard errors or a list of all results when `full == TRUE`. } \description{ -Performs Difference-in-Differences (DID) estimation fast. +Performs Difference-in-Differences (DID) estimation. +} +\details{ +`balanced_event_time` is only meaningful when `result_type == "dynamic`. + +`result_type` as `group-group-time` and `dynamic staggered` is only meaningful when using double did. + +`biter` and `clustervar` is only used when `boot == TRUE`. } \examples{ # simulated data -simdt <- sim_did(1e+03, 10, cov = "cont", second_cov = TRUE, second_outcome = TRUE) +simdt <- sim_did(1e+02, 10, cov = "cont", second_cov = TRUE, second_outcome = TRUE, seed = 1) dt <- simdt$dt #basic call @@ -93,24 +115,6 @@ result <- fastdid(data = dt, timevar = "time", cohortvar = "G", unitvar = "unit", outcomevar = "y", result_type = "group_time") -#control for covariates -result2 <- fastdid(data = dt, timevar = "time", cohortvar = "G", - unitvar = "unit", outcomevar = "y", - result_type = "group_time", - covariatesvar = c("x", "x2")) - -#bootstrap and clustering -result3 <- fastdid(data = dt, timevar = "time", cohortvar = "G", - unitvar = "unit", outcomevar = "y", - result_type = "group_time", - boot = TRUE, clustervar = "x") - -#estimate for multiple outcomes -result4 <- fastdid(data = dt, #the dataset - timevar = "time", cohortvar = "G", unitvar = "unit", - outcomevar = c("y", "y2"), #name of the outcome columns - result_type = "group_time") - } \keyword{computation} \keyword{data} diff --git a/man/plot_did_dynamics.Rd b/man/plot_did_dynamics.Rd index b98bd0a..c2ab329 100644 --- a/man/plot_did_dynamics.Rd +++ b/man/plot_did_dynamics.Rd @@ -1,21 +1,34 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot_event_dynamics.R +% Please edit documentation in R/generics.R \name{plot_did_dynamics} \alias{plot_did_dynamics} -\title{Create an event study plot for Difference-in-Differences (DiD) analysis.} +\title{Plot event study} \usage{ -plot_did_dynamics(dt, graphname = "event study plot", note = "") +plot_did_dynamics(x, margin = "event_time") } \arguments{ -\item{dt}{A data table containing the results of the DiD analysis. It should include columns for 'att' (average treatment effect), 'se' (standard error), and 'event_time' (time points).} +\item{x}{A data table generated with [fastdid] with one-dimensional index.} -\item{graphname}{A character string specifying the title of the plot (default is "event study plot").} - -\item{note}{A character string for adding additional notes or comments to the plot (default is empty).} +\item{margin}{character, the x-axis of the plot} } \value{ -A ggplot2 object representing the event study plot. +A ggplot2 object } \description{ -This function generates an event study plot based on the results of a DiD analysis. +Plot event study results. +} +\examples{ + +# simulated data +simdt <- sim_did(1e+02, 10, seed = 1) +dt <- simdt$dt + +#estimation +result <- fastdid(data = dt, timevar = "time", cohortvar = "G", + unitvar = "unit", outcomevar = "y", + result_type = "dynamic") + +#plot +plot_did_dynamics(result) + } diff --git a/man/sim_did.Rd b/man/sim_did.Rd index e7fbb3d..70c614c 100644 --- a/man/sim_did.Rd +++ b/man/sim_did.Rd @@ -18,7 +18,10 @@ sim_did( balanced = TRUE, seed = NA, stratify = FALSE, - treatment_assign = "latent" + treatment_assign = "latent", + second_cohort = FALSE, + confound_ratio = 1, + second_het = "all" ) } \arguments{ @@ -49,6 +52,12 @@ sim_did( \item{stratify}{Whether to stratify the dataset based on a binary covariate.} \item{treatment_assign}{The method for treatment assignment ("latent" or "uniform").} + +\item{second_cohort}{include confounding events} + +\item{confound_ratio}{extent of event confoundedness} + +\item{second_het}{heterogeneity of the second event} } \value{ A list containing the simulated dataset (dt) and the treatment effect values (att). @@ -60,7 +69,4 @@ Simulates a dataset for a Difference-in-Differences analysis with various custom # Simulate a DiD dataset with default settings data <- sim_did(sample_size = 100, time_period = 5) -# Simulate a DiD dataset with customized settings -data <- sim_did(sample_size = 200, time_period = 8, cov = "int", hetero = "dynamic") - } diff --git a/tests/tinytest.R b/tests/tinytest.R new file mode 100644 index 0000000..097e1f1 --- /dev/null +++ b/tests/tinytest.R @@ -0,0 +1,9 @@ + +#from https://github.com/gdemin/maditr/blob/master/tests/tinytest.R +if(requireNamespace("tinytest", quietly=TRUE)){ + library(tinytest) + #options(covr = FALSE) + data.table::setDTthreads(0) + tinytest::test_package("fastdid", remove_side_effects=FALSE) + data.table::setDTthreads(NULL) +} \ No newline at end of file diff --git a/vignettes/double.Rmd b/vignettes/double.Rmd new file mode 100644 index 0000000..78ab0f2 --- /dev/null +++ b/vignettes/double.Rmd @@ -0,0 +1,78 @@ +--- +title: "double" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{double} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +Here we show how to deal with confounding events, see [Tsai (2024)](not_ready_yet) (coming soon!) for the full treatment. We first simulate some data with confounding events. + +```{r setup} +library(fastdid) +library(ggplot2) +library(data.table) +simdt <- sim_did(1e+04, 5, cov = "cont", hetero = "all", balanced = TRUE, seed = 1, + second_cohort = TRUE, second_het = "no") #comfounding event +dt <- simdt$dt #dataset + +#ground truth att +att <- simdt$att |> merge(dt[,.(w = .N),by = "G"], by = "G") +att[, event_time := time-G] +att <- att[event == 1,.(att = weighted.mean(attgt, w)), by = "event_time"] +``` + +Using the default estimator, the estimates (black line) is biased from the ground truth (red line). + +```{r, naive} +naive_result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", + outcomevar = "y", result_type = "dynamic") +plot_did_dynamics(naive_result) + + geom_line(aes(y = att, x = event_time), + data = att, color = "red") + theme_bw() +``` +Diagnostics can be obtained by using `time >= G2` as outcome. As we set the effect of the confounding event at 10 constantly, we can see that the bias of the default estimator is roughly 10 times the diagnostics. + +```{r, diag} +dt[, D2 := time >= G2] +diag <- fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", + outcomevar = "D2", result_type = "dynamic") +plot_did_dynamics(diag) + theme_bw() +``` + +Using the double did estimator with `cohortvar2`, the estimator recovers the ground truth. + +```{r, double} +double_result <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", + outcomevar = "y", result_type = "dynamic", + cohortvar2 = "G2", event_specific = TRUE) +plot_did_dynamics(double_result) + + geom_line(aes(y = att, x = event_time), + data = att, color = "red") + theme_bw() +``` + +Double DiD also allow for two additional aggregation scheme: group-group-time ("group_group_time") and dynamic-staggered ("dynamic_stagger", event time by event stagger, G1-G2). + +```{r, double} +double_result_ds <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", + outcomevar = "y", result_type = "dynamic_stagger", + cohortvar2 = "G2", event_specific = TRUE) + +double_result_ggt <-fastdid(data = dt, + timevar = "time", cohortvar = "G", unitvar = "unit", + outcomevar = "y", result_type = "group_group_time", + cohortvar2 = "G2", event_specific = TRUE) + +``` \ No newline at end of file diff --git a/vignettes/misc.Rmd b/vignettes/misc.Rmd new file mode 100644 index 0000000..58abded --- /dev/null +++ b/vignettes/misc.Rmd @@ -0,0 +1,101 @@ +--- +title: "misc" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{misc} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +# Performance + +`fastdid` is magnitudes faster than [did](https://github.com/bcallaway11/did), and 15x faster than the fastest alternative [DiDforBigData](https://github.com/setzler/DiDforBigData) (dfbd for short) for large dataset. fastdid also uses less memory. Here is a comparison of run time for fastdid, did, and dfbd using a panel of 10 periods and varying samples sizes. + +![](https://i.imgur.com/s5v32Rw.png) + +Unfortunately, the Author's computer fails to run **did** at 1 million units. For a rough idea, **DiDforBigData** is about 100x faster than **did** in Bradley Setzler's [benchmark](https://setzler.github.io/DiDforBigData/articles/Background.html). Other staggered DiD implementations are even slower than **did**. + +For memory: + +![](https://i.imgur.com/7emkgOz.png) + +For the benchmark, a baseline group-time ATT is estimated with no covariates control, no bootstrap, and no explicit parallelization. Computing time is measured by `microbenchmark` and peak RAM by `peakRAM`. + +# Validitiy + +Before each release, we conduct tests to ensure the validity of estimates from `fastdid`. + +## Basics: comparison with `did` + +For features included in CS, `fastdid` maintains a maximum of 1% difference from results from the `did` package. This margin of error is mostly for bootstrapped results due to its inherent randomess. For point estimates, the difference is smaller than 1e-12, and is most likely the result of [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). The relevant test files are [group-time](https://github.com/TsaiLintung/fastdid/blob/main/inst/tinytest/test_2_compare_gt.R) and [aggregates](https://github.com/TsaiLintung/fastdid/blob/main/inst/tinytest/test_3_compare_agg.R). + +## Extensions: coverage test + +For features not included in CS, `fastdid` maintains that the 95% confidence intervals have a coverage rate between 94% and 96%. + +The coverage rate is calculated by running 200 iterations. In each iteration, we test whether the confidence interval estimated covers the group-truth values. We then average the rate across iterations. Due to the randomness of coverage, the realized coverage fall outside of the thresholds in about 1% of the time. The relevant test file is [coverage](https://github.com/TsaiLintung/fastdid/blob/main/inst/tinytest/test_5_coverage.R). + +## Experimental Features: not tested + +As an attempt to balance the validity and flexibility of `fastdid`, "experimental features" is introduced in version 0.9.4. These features will be less tested and documented, and it is generally advised to not use them unless the user know what they and the package are doing. These experimental features can be accessed via the `exper` argument. For example, to use the `filtervar` feature, call `fastdid(..., exper = list(filtervar = "FF"))`. + +The current list of experimental features are + +- `max_control_cohort_diff`: limit the max cohort difference between treated and control group +- `filtervar`, `filtervar_post`: limit the units being used as treated and control group with a potentially-time-varying variable in base (post) period +- `only_balance_2by2`: only require observations to have non-NA values within each 2 by 2 DiD, instead of throughout all time periods. Can be an alternative way of dealing with unbalanced panel by filling the missing periods with NAs. Not recommended as CS only have `allow_unbalance_panel`, which uses a repeated cross-section 2 by 2 DiD estimator. +- `custom_scheme`: aggregate to user-defined parameters + + +# Comparison with `did` + +As the name suggests, **fastdid**'s goal is to be fast **did**. Besides performance, here are some comparisons between the two packages. + +## Estimator + +**fastdid**'s estimators is identical to **did**'s. As the performance gains mostly come from efficient data manipulation, the key estimation implementations are analogous. For example, 2x2 DiD (`estimate_did.R` and `DRDID::std_ipw_did_panel`), influence function from weights (`aggregate_gt.R/get_weight_influence`, `compute.aggte.R/wif`), and multiplier bootstrap (`get_se.R` and `mboot.R`). + +## Interface + +**fastdid** should feel similar to `att_gt`. But there are a few differences: + +Control group option: + +```{r table1, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'} +tabl <- " +| fastdid | did | control group used | +|-|-|-| +| both | notyettreated | never-treated + not-yet-but-eventually-treated | +| never| nevertreated | never-treated | +| notyet | | not-yet-but-eventually-treated | +" +cat(tabl) # output the table in a format good for HTML/PDF/docx conversion +``` + +Aggregated parameters: `fastdid` aggregates in the same function. + +```{r table2, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'} +tabl <- " +| fastdid | did | +|-|-| +| group_time | no aggregation | +|dynamic|dynamic| +|time|calendar| +|group|group| +|simple|simple| +" +cat(tabl) # output the table in a format good for HTML/PDF/docx conversion +``` + +## Other + +1. **fastdid** only offers inverse probability weights estimators for controlling for covariates when allowing for unbalanced panels. +2. **fastdid** use universal base periods as default. +