05-data.Rmd

# Example Dataset {-}

## WASH Benefits Bangladesh Study {#wash} {-}

The example data come from a study of the effect of water quality, sanitation,
hand washing, and nutritional interventions on child development in rural
Bangladesh (WASH Benefits Bangladesh), a cluster randomized controlled trial
[@luby2018effect]. The study enrolled pregnant women in their first or second
trimester from the rural villages of Gazipur, Kishoreganj, Mymensingh, and
Tangail districts of central Bangladesh, with an average of eight women per
cluster. Groups of eight geographically adjacent clusters were block randomized,
using a random number generator, into six intervention groups (all of which
received weekly visits from a community health promoter for the first 6 months
and every 2 weeks for the next 18 months) and a double-sized control group (no
intervention or health promoter visit). The six intervention groups were:

1. chlorinated drinking water;
2. improved sanitation;
3. handwashing with soap;
4. combined water, sanitation, and handwashing;
5. improved nutrition through counseling and provision of lipid-based nutrient
   supplements; and
6. combined water, sanitation, handwashing, and nutrition.

In the handbook, we concentrate on child growth (size for age) as the outcome of
interest. For reference, this trial was registered with ClinicalTrials.gov under
registration number NCT01590095.

```{r load_washb_data_intro}
library(readr)
# read in data via readr::read_csv
dat <- read_csv(
  paste0(
    "https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
    "wash-benefits/washb_data.csv"
  )
)
```
For instructional purposes, we start by treating the data as independent and
identically distributed (i.i.d.) random draws from a large target population. We
could account for the clustering of the data (within sampled geographic units),
but, we avoid these details in this handbook for the sake of clarity of
illustration.  Modifications of TL methodology for biased samples, repeated
measures, and related complications, are readily available.

We have `r ncol(dat)` variables measured, of which a single variable is set to
be the outcome of interest. This outcome, $Y$, is the weight-for-height Z-score
(`whz` in `dat`); the treatment of interest, $A$, is the randomized treatment
group (`tr` in `dat`); and the adjustment set (potential baseline confounders),
$W$, consists simply of *everything else*. This results in our observed data
structure being $n$ i.i.d.  copies of $O_i = (W_i, A_i, Y_i)$, for $i = 1,
\ldots, n$.

Using the [`skimr` package](https://CRAN.R-project.org/package=skimr), we can
quickly summarize the variables measured in the WASH Benefits data set:

```{r skim_washb_data, results="asis", echo=FALSE}
library(skimr)
# optionally disable sparkline graphs for PDF output
if (knitr::is_latex_output()) {
  knitr::kable(skim_no_sparks(dat), format = "latex")
} else {
  skim(dat)
}
```
A convenient summary of the relevant variables appears above, complete with a
sparkline visualizations describing the marginal characteristics of each
covariate. Note that the *asset* variables reflect socioeconomic status of the
study participants. Notice also the uniform distribution of the treatment groups
(with twice as many controls) -- this is, of course, by design.