-
Notifications
You must be signed in to change notification settings - Fork 173
/
Copy pathanova.Rmd
878 lines (606 loc) · 39.5 KB
/
anova.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
# Analysis of Variance
**Chapter Status:** This chapter should be considered optional for a first reading of this text. Its inclusion is mostly for the benefit of some courses that use the text. Additionally, this chapter is currently somewhat underdeveloped compared to the rest of the text. If you are interested in contributing, you can find several lines marked "TODO" in the source. Pull requests encouraged!
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
<style type="text/css">
.table {
width: 70%;
margin-left:10%;
margin-right:10%;
}
</style>
<!-- TODO: use kable styling instead of custom CSS -->
> "To find out what happens when you change something, it is necessary to change it."
>
> --- **Box, Hunter, and Hunter**, Statistics for Experimenters (1978)
Thus far, we have built models for numeric responses, when the predictors are all numeric. We'll take a minor detour to go back and consider models which *only* have **categorical** predictors. A categorical predictor is a variable which takes only a finite number of values, which are not ordered. For example a variable which takes possible values `red`, `blue`, `green` is categorical. In the context of using a categorical variable as a predictor, it would place observations into different groups (categories).
We've also mostly been dealing with observational data. The methods in this section are most useful in experimental settings, but still work with observational data. (However, for determining causation, we require experiments.)
## Experiments
The biggest difference between an observational study and an experiment is *how* the predictor data is obtained. Is the experimenter in control?
- In an **observational** study, both response and predictor data are obtained via observation.
- In an **experiment**, the predictor data are values determined by the experimenter. The experiment is run and the response is observed.
In an experiment, the predictors, which are controlled by the experimenter, are called **factors**. The possible values of these factors are called **levels**. Subjects are *randomly* assigned to a level of each of the factors.
The design of experiments could be a course by itself. The Wikipedia article on [design of experiments](https://en.wikipedia.org/wiki/Design_of_experiments){target="_blank"} gives a good overview. Originally, most of the methodology was developed for agricultural applications by [R. A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher){target="_blank"}, but is still in use today, now in a wide variety of application areas. Notably, these methods have seen a resurgence as a part of "A/B Testing."
<!-- TODO: In the future, discuss the Morrow Plots: http://cropsci.illinois.edu/research/morrow -->
## Two-Sample t-Test
The simplest example of an experimental design is the setup for a two-sample $t$-test. There is a single factor variable with two levels which split the subjects into two groups. Often, one level is considered the **control**, while the other is the **treatment**. The subjects are randomly assigned to one of the two groups. After being assigned to a group, each subject has some quantity measured, which is the response variable.
Mathematically, we consider the model
\[
y_{ij} \sim N(\mu_i, \sigma^2)
\]
where $i = 1, 2$ for the two groups and $j = 1, 2, \ldots n_i$. Here $n_i$ is the number of subjects in group $i$. So $y_{13}$ would be the measurement for the third member of the first group.
So measurements of subjects in group $1$ follow a normal distribution with mean $\mu_1$.
\[
y_{1j} \sim N(\mu_1, \sigma^2)
\]
Then measurements of subjects in group $2$ follow a normal distribution with mean $\mu_2$.
\[
y_{2j} \sim N(\mu_2, \sigma^2)
\]
This model makes a number of assumptions. Specifically,
- The observations follow a **normal** distribution. The mean of each group is different.
- **Equal variance** for each group.
- **Independence**. Which is believable if groups were randomly assigned.
Later, we will investigate the normal and equal variance assumptions. For now, we will continue to assume they are reasonable.
The natural question to ask: Is there a difference between the two groups? The specific question we'll answer: Are the means of the two groups different?
Mathematically, that is
\[
H_0: \mu_1 = \mu_2 \quad \text{vs} \quad H_1: \mu_1 \neq \mu_2
\]
For the stated model and assuming the null hypothesis is true, the $t$ test statistic would follow a $t$ distribution with degrees of freedom $n_1 + n_2 - 2$.
```{r, echo = FALSE}
set.seed(42)
sleep_means = c(6.5, 8.2)
sleep_sigma = 1.2
melatonin = data.frame(
sleep = rnorm(n = 20, mean = sleep_means, sd = sleep_sigma),
group = rep(c("control", "treatment"), 10)
)
```
As an example, suppose we are interested in the effect of [melatonin](https://en.wikipedia.org/wiki/Melatonin){target="_blank"} on sleep duration. A researcher obtains a random sample of 20 adult males. Of these subjects, 10 are randomly chosen for the control group, which will receive a placebo. The remaining 10 will be given 5mg of melatonin before bed. The sleep duration in hours of each subject is then measured. The researcher chooses a significance level of $\alpha = 0.10$. Was sleep duration affected by the melatonin?
```{r}
melatonin
```
Here, we would like to test,
\[
H_0: \mu_C = \mu_T \quad \text{vs} \quad H_1: \mu_C \neq \mu_T
\]
To do so in `R`, we use the `t.test()` function, with the `var.equal` argument set to `TRUE`.
```{r}
t.test(sleep ~ group, data = melatonin, var.equal = TRUE)
```
At a significance level of $\alpha = 0.10$, we reject the null hypothesis. It seems that the melatonin had a **statistically significant** effect. Be aware that statistical significance is not always the same as scientific or **practical significance**. To determine practical significance, we need to investigate the **effect size** in the context of the situation. Here the effect size is the difference of the sample means.
```{r}
t.test(sleep ~ group, data = melatonin, var.equal = TRUE)$estimate
```
Here we see that the subjects in the melatonin group sleep an average of about 1.5 hours longer than the control group. An hour and a half of sleep is certainly important!
With a big enough sample size, we could make an effect size of say, four minutes statistically significant. Is it worth taking a pill every night to get an extra four minutes of sleep? (Probably not.)
```{r}
boxplot(sleep ~ group, data = melatonin, col = 5:6)
```
<!-- TODO: other parametarization, explain identify -->
## One-Way ANOVA
What if there are more than two groups? Consider the model
\[
y_{ij} = \mu + \alpha_i + e_{ij}
\]
where
\[
\sum_{i=1}^{g} \alpha_i = 0
\]
and
\[
e_{ij} \sim N(0,\sigma^{2}).
\]
Here,
- $i = 1, 2, \ldots g$ where $g$ is the number of groups.
- $j = 1, 2, \ldots n_i$ where $n_i$ is the number of observations in group $i$.
Then the total sample size is
\[
N = \sum_{i = 1}^{g} n_i
\]
Observations from group $i$ follow a normal distribution
\[
y_{ij} \sim N(\mu_i,\sigma^{2})
\]
where the mean of each group is given by
\[
\mu_i = \mu + \alpha_i.
\]
Here $\alpha_i$ measures the effect of group $i$. It is the difference between the overall mean and the mean of group $i$.
Essentially, the assumptions here are the same as the two-sample case, however now, we simply have more groups.
Much like the two-sample case, we would again like to test if the means of the groups are equal.
\[
H_0: \mu_1 = \mu_2 = \ldots \mu_g \quad \text{vs} \quad H_1: \text{ Not all } \mu_i \text{ are equal.}
\]
Notice that the alternative simply indicates that some of the means are not equal, not specifically which are not equal. More on that later.
Alternatively, we could write
\[
H_0: \alpha_1 = \alpha_2 = \ldots = \alpha_g = 0 \quad \text{vs} \quad H_1: \text{ Not all } \alpha_i \text{ are } 0.
\]
This test is called **Analysis of Variance (ANOVA)**. ANOVA compares the variation due to specific sources (between groups) with the variation among individuals who should be similar (within groups). In particular, ANOVA tests whether several populations have the same mean by comparing how far apart the sample means are with how much variation there is within the samples. We use variability of means to test for equality of means, thus the use of *variance* in the name for a test about means.
We'll leave out most of the details about how the estimation is done, but we'll see later, that it is done via least squares. We'll use `R` to obtain these estimates, but they are actually rather simple. We only need to think about the sample means of the groups.
- $\bar{y}_i$ is the sample mean of group $i$.
- $\bar{y}$ is the overall sample mean.
- $s_{i}^{2}$ is the sample variance of group $i$.
We'll then decompose the variance, as we've seen before in regression. The **total** variation measures how much the observations vary about the overall sample mean, *ignoring the groups*.
\[
SST = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y})^2
\]
The variation **between** groups looks at how far the individual sample means are from the overall sample mean.
\[
SSB = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (\bar{y}_i - \bar{y})^2 = \sum_{i = 1}^{g} n_i (\bar{y}_i - \bar{y})^2
\]
Lastly, the **within** group variation measures how far observations are from the sample mean of its group.
\[
SSW = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y}_i)^2 = \sum_{i = 1}^{g} (n_i - 1) s_{i}^{2}
\]
This could also be thought of as the error sum of squares, where $y_{ij}$ is an observation and $\bar{y}_i$ is its fitted (predicted) value from the model.
To develop the test statistic for ANOVA, we place this information into an ANVOA table.
| Source | Sum of Squares | Degrees of Freedom | Mean Square | $F$ |
|---------|--------------|----------------|------------|-------------|
| Between | SSB | $g - 1$ | SSB / DFB | MSB / MSW |
| Within | SSW | $N - g$ | SSW / DFW | |
| Total | SST | $N - 1$ | | |
We reject the null (equal means) when the $F$ statistic is large. This occurs when the variation between groups is large compared to the variation within groups. Under the null hypothesis, the distribution of the test statistic is $F$ with degrees of freedom $g - 1$ and $N - g$.
```{r, echo = FALSE}
library(broom)
plot_anova = function(n = 20, mu_a = 0, mu_b = 0, mu_c = 0, sigma = 1) {
response = rnorm(n * 3, mean = c(mu_a , mu_b, mu_c), sd = sigma)
group = factor(rep(LETTERS[1:3], n))
xmin = min(c(mu_a , mu_b, mu_c)) - 3 * sigma
xmax = max(c(mu_a , mu_b, mu_c)) + 3 * sigma
plot(0, main = "Truth",
xlim = c(xmin, xmax), ylim = c(0, 0.40), type = "n",
xlab = "observations", ylab = "density")
curve(dnorm(x, mean = mu_a, sd = sigma),
from = mu_a - 3 * sigma, to = mu_a + 3 * sigma,
add = TRUE, lwd = 2, col = "dodgerblue", lty = 1)
curve(dnorm(x, mean = mu_b, sd = sigma),
from = mu_b - 3 * sigma, to = mu_b + 3 * sigma,
add = TRUE, lwd = 2, col = "darkorange", lty = 2)
curve(dnorm(x, mean = mu_c, sd = sigma),
from = mu_c - 3 * sigma, to = mu_c + 3 * sigma,
add = TRUE, lwd = 3, col = "black", lty = 3)
rug(response[group == "A"], col = "dodgerblue",
lwd = 1.5, ticksize = 0.1, quiet = TRUE, lty = 1)
rug(response[group == "B"], col = "darkorange",
lwd = 1.5, ticksize = 0.1, quiet = TRUE, lty = 2)
rug(response[group == "C"], col = "black",
lwd = 2, ticksize = 0.1, quiet = TRUE, lty = 3)
boxplot(response ~ group, xlab = "group", main = "Observed Data", medcol = "white", varwidth = FALSE)
stripchart(response ~ group, vertical = TRUE, method = "jitter", add = TRUE,
pch = 20, col = c("dodgerblue", "darkorange", "black"))
abline(h = mean(response), lwd = 3, lty = 1, col = "darkgrey")
segments(x0 = 0.6, x1 = 1.4, y0 = mean(response[group == "A"]), y1 = mean(response[group == "A"]), lwd = 2, lty = 2, col = "dodgerblue")
segments(x0 = 1.6, x1 = 2.4, y0 = mean(response[group == "B"]), y1 = mean(response[group == "B"]), lwd = 2, lty = 2, col = "darkorange")
segments(x0 = 2.6, x1 = 3.4, y0 = mean(response[group == "C"]), y1 = mean(response[group == "C"]), lwd = 2, lty = 2, col = "black")
# use lm instead of aov for ease of use with broom::glance
aov_results = lm(response ~ group)
f_stat = glance(aov_results)$statistic
p_val = glance(aov_results)$p.value
list(f = f_stat, p = p_val)
# summary(aov_results)
}
```
<!-- TODO: This should be a Shiny app. -->
<!-- TODO: Should output entire ANOVA table. -->
<!-- TODO: Do with ggplot2 and custom boxplots with show min, sd, mean, sd, max -->
```{r, echo = FALSE, eval = FALSE}
library(manipulate)
par(mfrow = c(1, 2))
manipulate(plot_anova(n, mu_a, mu_b, mu_c, sigma),
n = slider(2, 50, 20),
mu_a = slider(-10, 10, 0),
mu_b = slider(-10, 10, 0),
mu_c = slider(-10, 10, 0),
sigma = slider(0, 10, 1, step = 0.5))
```
Let's see what this looks like in a few situations. In each of the following examples, we'll consider sampling 20 observations ($n_i = 20$) from three populations (groups; $g = 3$).
First, consider $\mu_A = -5, \mu_B = 0, \mu_C = 5$ with $\sigma = 1$.
```{r, fig.height = 5, fig.width = 10, echo = FALSE}
set.seed(42)
par(mfrow = c(1, 2))
p1 = plot_anova(n = 20, mu_a = -5, mu_b = 0, mu_c = 5, sigma = 1)
```
The left panel shows the three normal distributions we are sampling from. The ticks along the $x$-axis show the randomly sampled observations. The right panel re-displays only the sampled values in a boxplot. Note that the mid-line of the boxes is usually the sample median. These boxplots have been modified to use the sample mean.
Here the sample means vary a lot around the overall sample mean, which is the solid grey line on the right panel. Within the groups there is variability, but it is still obvious that the sample means are very different.
As a result, we obtain a *large* test statistic, thus *small* p-value.
- $F = `r p1$f`$
- $\text{p-value} = `r p1$p`$
Now consider $\mu_A = 0, \mu_B = 0, \mu_C = 0$ with $\sigma = 1$. That is, equal means for the groups.
```{r, fig.height = 5, fig.width = 10, echo = FALSE}
set.seed(1337)
par(mfrow = c(1, 2))
p2 = plot_anova(n = 20, mu_a = 0, mu_b = 0, mu_c = 0, sigma = 1)
```
Here the sample means vary only a tiny bit around the overall sample mean. Within the groups there is variability, this time much larger than the variability of the sample means.
As a result, we obtain a *small* test statistic, thus *large* p-value.
- $F = `r p2$f`$
- $\text{p-value} = `r p2$p`$
The next two examples show different means, with different levels of noise. Notice how these affect the test statistic and p-value.
- $\mu_A = -1, \mu_B = 0, \mu_C = 1, \sigma = 1$
```{r, fig.height = 5, fig.width = 10, echo = FALSE}
set.seed(42)
par(mfrow = c(1, 2))
p3 = plot_anova(n = 20, mu_a = -1, mu_b = 0, mu_c = 1, sigma = 1)
```
- $F = `r p3$f`$
- $\text{p-value} = `r p3$p`$
Above, there isn't obvious separation between the groups like the first example, but it is still obvious the means are different. Below, there is more noise. Visually it is somewhat hard to tell, but the test still suggests a difference of means. (At an $\alpha$ of 0.05.)
- $\mu_A = -1, \mu_B = 0, \mu_C = 1, \sigma = 2$
```{r, fig.height = 5, fig.width = 10, echo = FALSE}
set.seed(42)
par(mfrow = c(1, 2))
p4 = plot_anova(n = 20, mu_a = -1, mu_b = 0, mu_c = 1, sigma = 2)
```
- $F = `r p4$f`$
- $\text{p-value} = `r p4$p`$
Let's consider an example with real data. We'll use the `coagulation` dataset from the `faraway` package. Here four different diets (`A`, `B`, `C`, `D`) were administered to a random sample of 24 animals. The subjects were randomly assigned to one of the four diets. For each, their blood coagulation time was measured in seconds.
Here we would like to test
\[
H_0: \mu_A = \mu_B = \mu_C = \mu_D
\]
where, for example, $\mu_A$ is the mean blood coagulation time for an animal that ate diet `A`.
```{r}
library(faraway)
names(coagulation)
plot(coag ~ diet, data = coagulation, col = 2:5)
```
We first load the data and create the relevant boxplot. The plot alone suggests a difference of means. The `aov()` function is used to obtain the relevant sums of squares. Using the `summary()` function on the output from `aov()` creates the desired ANOVA table (without the unneeded row for total).
```{r}
coag_aov = aov(coag ~ diet, data = coagulation)
coag_aov
summary(coag_aov)
```
Were we to run this experiment, we would have pre-specified a significance level. However, notice that the p-value of this test is incredibly low, so using any reasonable significance level we would reject the null hypothesis. Thus we believe the diets had an effect on blood coagulation time.
```{r}
diets = data.frame(diet = unique(coagulation$diet))
data.frame(diets, coag = predict(coag_aov, diets))
```
Here, we've created a dataframe with a row for each diet. By predicting on this dataframe, we obtain the sample means of each diet (group).
### Factor Variables
When performing ANOVA in `R`, be sure the grouping variable is a factor variable. If it is not, your result might not be ANOVA, but instead a linear regression with the predictor variable considered numeric.
```{r}
set.seed(42)
response = rnorm(15)
group = c(rep(1, 5), rep(2, 5), rep(3, 5))
bad = data.frame(response, group)
summary(aov(response ~ group, data = bad)) # wrong DF!
good = data.frame(response, group = as.factor(group))
summary(aov(response ~ group, data = good))
is.factor(bad$group) # 1, 2, and 3 are numbers.
is.factor(good$group) # 1, 2, and 3 are labels.
```
### Some Simulation
Here we verify the distribution of the test statistic under the null hypothesis. We simulate from a null model (equal variance) to obtain an empirical distribution of the $F$ statistic. We add the curve for the expected distribution.
```{r sim-anova-function}
library(broom)
sim_anova = function(n = 10, mu_a = 0, mu_b = 0, mu_c = 0, mu_d = 0, sigma = 1, stat = TRUE) {
# create data from one-way ANOVA model with four groups of equal size
# response simulated from normal with group mean, shared variance
# group variable indicates group A, B, C or D
sim_data = data.frame(
response = c(rnorm(n = n, mean = mu_a, sd = sigma),
rnorm(n = n, mean = mu_b, sd = sigma),
rnorm(n = n, mean = mu_c, sd = sigma),
rnorm(n = n, mean = mu_d, sd = sigma)),
group = c(rep("A", times = n), rep("B", times = n),
rep("C", times = n), rep("D", times = n))
)
# obtain F-statistic and p-value for testing difference of means
# use lm instead of aov for better result formatting with glance
aov_results = lm(response ~ group, data = sim_data)
f_stat = glance(aov_results)$statistic
p_val = glance(aov_results)$p.value
# return f_stat if stat = TRUE, otherwise, p-value
ifelse(stat, f_stat, p_val)
}
```
```{r sample-sim-anova-simulation, eval = FALSE}
f_stats = replicate(n = 5000, sim_anova(stat = TRUE))
```
```{r cache-simulation, eval = FALSE, echo = FALSE}
f_stats = replicate(n = 5000, sim_anova(stat = TRUE))
dir.create("sim-result-cache", showWarnings = FALSE)
saveRDS(f_stats, "sim-result-cache/f_statistics_simulation.rds", version = 2)
```
```{r load-cached-fstats, echo = FALSE}
f_stats = readRDS("sim-result-cache/f_statistics_simulation.rds")
```
```{r sim-anova-graphs}
hist(f_stats, breaks = 100, prob = TRUE, border = "dodgerblue", main = "Empirical Distribution of F")
curve(df(x, df1 = 4 - 1, df2 = 40 - 4), col = "darkorange", add = TRUE, lwd = 2)
```
### Power
Now that we’re performing experiments, getting more data means finding more test subjects, running more lab tests, etc. In other words, it will cost more time and money.
We'd like to design our experiment so that we have a good chance of detecting an interesting effect size, without spending too much money. There's no point in running an experiment if there’s only a very low chance that it has a significant result that you care about. (Remember, not all statistically significant results have practical value.)
We'd like the ANOVA test to have high **power** for an alternative hypothesis with a minimum desired effect size.
\[
\text{Power } = P(\text{Reject } H_0 \mid H_0 \text{ False})
\]
That is, for a true difference of means that we deem interesting, we want the test to reject with high probability.
A number of things can affect the power of a test:
- **Effect size**. It is easier to detect larger effects.
- **Noise level** $\sigma$. The less noise, the easier it is to detect signal (effect). We don't have much ability to control this, except maybe to measure more accurately.
- **Significance level** $\alpha$. Lower significance level makes rejecting more difficult. (But also allows for fewer false positives.)
- **Sample size**. Large samples mean easier to detect effects.
- **Balanced design**. An equal number of observations per group leads to higher power.
The following simulations look at the effect of significance level, effect size, and noise level on the power of an ANOVA $F$-test. Homework will look into sample size and balance.
```{r}
p_vals = replicate(n = 1000, sim_anova(mu_a = -1, mu_b = 0, mu_c = 0, mu_d = 1,
sigma = 1.5, stat = FALSE))
mean(p_vals < 0.05)
mean(p_vals < 0.01)
```
```{r}
p_vals = replicate(n = 1000, sim_anova(mu_a = -1, mu_b = 0, mu_c = 0, mu_d = 1,
sigma = 2.0, stat = FALSE))
mean(p_vals < 0.05)
mean(p_vals < 0.01)
```
```{r}
p_vals = replicate(n = 1000, sim_anova(mu_a = -2, mu_b = 0, mu_c = 0, mu_d = 2,
sigma = 2.0, stat = FALSE))
mean(p_vals < 0.05)
mean(p_vals < 0.01)
```
## Post Hoc Testing
Suppose we reject the null hypothesis from the ANOVA test for equal means. That tells us that the means are different. But which means? All of them? Some of them? The obvious strategy is to test all possible comparisons of two means. We can do this easily in `R`.
```{r}
with(coagulation, pairwise.t.test(coag, diet, p.adj = "none"))
# pairwise.t.test(coagulation$coag, coagulation$diet, p.adj = "none")
```
Notice the `pairwise.t.test()` function does not have a `data` argument. To avoid using `attach()` or the `$` operator, we introduce the `with()` function. The commented line would perform the same operation.
Also note that we are using the argument `p.adj = "none"`. What is this? An adjustment (in this case not an adjustment) to the p-value of each test. Why would we need to do this?
The adjustment is an attempt to correct for the [multiple testing problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem){target="_blank"}. (See also: [Relevant XKCD](https://xkcd.com/882/){target="_blank"}. ) Imagine that you knew ahead of time that you were going to perform 100 $t$-tests. Suppose you wish to do this with a false positive rate of $\alpha = 0.05$. If we use this significance level for each test, for 100 tests, we then expect 5 false positives. That means, with 100 tests, we're almost guaranteed to have at least one error.
What we'd really like is for the [family-wise error rate (FWER)](https://en.wikipedia.org/wiki/Family-wise_error_rate){target="_blank"} to be 0.05. If we consider the 100 tests to be a single "experiment," the FWER is the rate of one or more false positives for the full experiment (100 tests). Consider it an error rate for an entire procedure, instead of a single test.
With this in mind, one of the simplest adjustments we can make, is to increase the p-values for each test, depending on the number of tests. In particular the Bonferroni correction simply multiplies by the number of tests.
\[
\text{p-value-bonf} = \min(1, n_{tests} \cdot \text{p-value})
\]
```{r}
with(coagulation, pairwise.t.test(coag, diet, p.adj = "bonferroni"))
```
We see that these p-values are much higher than the unadjusted p-values, thus, we are less likely to reject each test. As a result, the FWER is 0.05, instead of an error rate of 0.05 for each test.
We can simulate the 100 test scenarios to illustrate this point.
```{r sim-p-val-function}
get_p_val = function() {
# create data for two groups, equal mean
y = rnorm(20, mean = 0, sd = 1)
g = c(rep("A", 10), rep("B", 10))
# p-value of t-test when null is true
glance(t.test(y ~ g, var.equal = TRUE))$p.value
}
```
```{r cache-p-val-simulation, eval = FALSE, echo = FALSE}
set.seed(1337)
simulation_100_p_val = mean(replicate(1000, any(replicate(100, get_p_val()) < 0.05)))
simulation_100_p_val_bonf = mean(replicate(1000, any(p.adjust(replicate(100, get_p_val()), "bonferroni") < 0.05)))
# Ensure sim-result-cache is present
dir.create("sim-result-cache", showWarnings = FALSE)
# Save cached versions
saveRDS(simulation_100_p_val, "sim-result-cache/simulation_100_p_val.rds", version = 2)
saveRDS(simulation_100_p_val_bonf, "sim-result-cache/simulation_100_p_val_bonf.rds", version = 2)
```
```{r, eval = FALSE}
set.seed(1337)
# FWER with 100 tests
# desired rate = 0.05
# no adjustment
mean(replicate(1000, any(replicate(100, get_p_val()) < 0.05)))
```
```{r load-cached-pval-anova, echo = FALSE}
simulation_100_p_val = readRDS("sim-result-cache/simulation_100_p_val.rds")
simulation_100_p_val
```
```{r, eval = FALSE}
# FWER with 100 tests
# desired rate = 0.05
# bonferroni adjustment
mean(replicate(1000, any(p.adjust(replicate(100, get_p_val()), "bonferroni") < 0.05)))
```
```{r load-cached-pval-bonferroni-anova, echo = FALSE}
simulation_100_p_val_bonf = readRDS("sim-result-cache/simulation_100_p_val_bonf.rds")
simulation_100_p_val_bonf
```
For the specific case of testing all two-way mean differences after an ANOVA test, there are [a number of potential methods](https://en.wikipedia.org/wiki/Post_hoc_analysis){target="_blank"} for making an adjustment of this type. The pros and cons of the potential methods are beyond the scope of this course. We choose a method for its ease of use, and to a lesser extent, its developer.
Tukey's Honest Significance difference can be applied directly to an object which was created using `aov()`. It will adjust the p-values of the pairwise comparisons of the means to control the FWER, in this case, for 0.05. Notice it also gives confidence intervals for the difference of the means.
```{r}
TukeyHSD(coag_aov, conf.level = 0.95)
```
Based on these results, we see no difference between `A` and `D` as well as `B` and `C`. All other pairwise comparisons are significant. If you return to the original boxplot, these results should not be surprising.
Also, nicely, we can easily produce a plot of these confidence intervals.
```{r}
plot(TukeyHSD(coag_aov, conf.level = 0.95))
```
The creator of this method, [John Tukey](https://en.wikipedia.org/wiki/John_Tukey){target="_blank"}, is an important figure in the history of data science. He essentially [predicted the rise of data science over 50 years ago](https://projecteuclid.org/euclid.aoms/1177704711){target="_blank"}. For some retrospective thoughts on those 50 years, see [this paper from David Donoho](http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf){target="_blank"}.
## Two-Way ANOVA
What if there is more than one factor variable? Why do we need to limit ourselves to experiments with only one factor? We don't! Consider the model
\[
y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk}
\]
where $\epsilon_{ijk}$ are $N(0, \sigma^2)$ random variables.
We add constraints
\[
\sum_{i=1}^{I} \alpha_i = 0 \quad \quad \sum_{j=1}^{J} \beta_j = 0.
\]
and
\[
(\alpha \beta)_{1j} + (\alpha \beta)_{2j} + (\alpha \beta)_{3j} = 0 \\
(\alpha \beta)_{i1} + (\alpha \beta)_{i2} + (\alpha \beta)_{i3} + (\alpha \beta)_{i4} = 0
\]
for any $i$ or $j$.
Here,
- $i = 1, 2, \ldots I$ where $I$ is the number of levels of factor $A$.
- $j = 1, 2, \ldots J$ where $J$ is the number of levels of factor $B$.
- $k = 1, 2, \ldots K$ where $K$ is the number of replicates per group.
Here, we can think of a group as a combination of a level from each of the factors. So for example, one group will receive level $2$ of factor $A$ and level $3$ of factor $B$. The number of replicates is the number of subjects in each group. Here $y_{135}$ would be the measurement for the fifth member (replicate) of the group for level $1$ of factor $A$ and level $3$ of factor $B$.
We call this setup an $I \times J$ **factorial design** with $K$ replicates. (Our current notation only allows for equal replicates in each group. It isn't difficult to allow for different replicates for different groups, but we'll proceed using equal replicates per group, which if possible, is desirable.)
- $\alpha_i$ measures the effect of level $i$ of factor $A$. We call these the **main effects** of factor $A$.
- $\beta_j$ measures the effect of level $j$ of factor $B$. We call these the **main effects** of factor $B$.
- $(\alpha \beta)_{ij}$ is a single parameter. We use $\alpha \beta$ to note that this parameter measures the **interaction** between the two main effects.
Under this setup, there are a number of models that we can compare. Consider a $2 \times 2$ factorial design. The following tables show the means for each of the possible groups under each model.
**Interaction** Model: $y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk}$
| | **Factor B, Level 1**| **Factor B, Level 2**|
|------------------|------------------|------------------|
| **Factor A, Level 1** | $\mu + \alpha_1 + \beta_1 + (\alpha\beta)_{11}$ | $\mu + \alpha_1 + \beta_2 + (\alpha\beta)_{12}$ |
| **Factor A, Level 2** | $\mu + \alpha_2 + \beta_1 + (\alpha\beta)_{21}$ | $\mu + \alpha_2 + \beta_2 + (\alpha\beta)_{22}$ |
**Additive** Model: $y_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk}$
| | **Factor B, Level 1**| **Factor B, Level 2**|
|------------------|------------------|------------------|
| **Factor A, Level 1** | $\mu + \alpha_1 + \beta_1$ | $\mu + \alpha_1 + \beta_2$ |
| **Factor A, Level 2** | $\mu + \alpha_2 + \beta_1$ | $\mu + \alpha_2 + \beta_2$ |
**Factor B** Only Model (One-Way): $y_{ijk} = \mu + \beta_j + \epsilon_{ijk}$
| | **Factor B, Level 1**| **Factor B, Level 2**|
|------------------|------------------|------------------|
| **Factor A, Level 1** | $\mu + \beta_1$ | $\mu + \beta_2$ |
| **Factor A, Level 2** | $\mu + \beta_1$ | $\mu + \beta_2$ |
**Factor A** Only Model (One-Way): $y_{ijk} = \mu + \alpha_i + \epsilon_{ijk}$
| | **Factor B, Level 1**| **Factor B, Level 2**|
|------------------|------------------|------------------|
| **Factor A, Level 1** | $\mu + \alpha_1$ | $\mu + \alpha_1$ |
| **Factor A, Level 2** | $\mu + \alpha_2$ | $\mu + \alpha_2$ |
**Null** Model: $y_{ijk} = \mu + \epsilon_{ijk}$
| | **Factor B, Level 1**| **Factor B, Level 2**|
|------------------|------------------|------------------|
| **Factor A, Level 1** | $\mu$ | $\mu$ |
| **Factor A, Level 2** | $\mu$ | $\mu$ |
The question then, is which of these models should we use if we have two factors? The most important question to consider is whether or not we should model the **interaction**. Is the effect of Factor A the *same* for all levels of Factor B? In the additive model, yes. In the interaction model, no. Both models would use a different mean for each group, but in a very specific way in both cases.
Let's discuss these comparisons by looking at some examples. We'll first look at the `rats` data from the `faraway` package. There are two factors here: `poison` and `treat`. We use the `levels()` function to extract the levels of a factor variable.
```{r}
levels(rats$poison)
levels(rats$treat)
```
Here, 48 rats were randomly assigned both one of three poisons and one of four possible treatments. The experimenters then measured their survival time in tens of hours. A total of 12 groups, each with 4 replicates.
Before running any tests, we should first look at the data. We will create **interaction plots**, which will help us visualize the effect of one factor, as we move through the levels of another factor.
```{r, fig.height = 5, fig.width = 15}
par(mfrow = c(1, 2))
with(rats, interaction.plot(poison, treat, time, lwd = 2, col = 1:4))
with(rats, interaction.plot(treat, poison, time, lwd = 2, col = 1:3))
```
If there is not interaction, thus an additive model, we would expect to see parallel lines. That would mean, when we change the level of one factor, there can be an effect on the response. However, the difference between the levels of the other factor should still be the same.
The obvious indication of interaction would be lines that cross while heading in different directions. Here we don't see that, but the lines aren't strictly parallel, and there is some overlap on the right panel. However, is this interaction effect significant?
Let's fit each of the possible models, then investigate their estimates for each of the group means.
```{r}
rats_int = aov(time ~ poison * treat, data = rats) # interaction model
rats_add = aov(time ~ poison + treat, data = rats) # additive model
rats_pois = aov(time ~ poison , data = rats) # single factor model
rats_treat = aov(time ~ treat, data = rats) # single factor model
rats_null = aov(time ~ 1, data = rats) # null model
```
To get the estimates, we'll create a table which we will predict on.
```{r}
rats_table = expand.grid(poison = unique(rats$poison), treat = unique(rats$treat))
rats_table
matrix(paste0(rats_table$poison, "-", rats_table$treat) , 4, 3, byrow = TRUE)
```
Since we'll be repeating ourselves a number of times, we write a function to perform the prediction. Some housekeeping is done to keep the estimates in order, and provide row and column names. Above, we've shown where each of the estimates will be placed in the resulting matrix.
```{r}
get_est_means = function(model, table) {
mat = matrix(predict(model, table), nrow = 4, ncol = 3, byrow = TRUE)
colnames(mat) = c("I", "II", "III")
rownames(mat) = c("A", "B", "C", "D")
mat
}
```
First, we obtain the estimates from the **interaction** model. Note that each cell has a different value.
```{r}
knitr::kable(get_est_means(model = rats_int, table = rats_table))
```
```{r, echo = FALSE, eval = FALSE}
model.tables(rats_int, type = "mean")
```
Next, we obtain the estimates from the **additive** model. Again, each cell has a different value. We also see that these estimates are somewhat close to those from the interaction model.
```{r}
knitr::kable(get_est_means(model = rats_add, table = rats_table))
```
To understand the difference, let's consider the effect of the treatments.
```{r}
additive_means = get_est_means(model = rats_add, table = rats_table)
additive_means["A",] - additive_means["B",]
```
```{r}
interaction_means = get_est_means(model = rats_int, table = rats_table)
interaction_means["A",] - interaction_means["B",]
```
This is the key difference between the interaction and additive models. The difference between the effect of treatments `A` and `B` is the **same** for each poison in the additive model. They are **different** in the interaction model.
The remaining three models are much simpler, having either only row or only column effects. Or no effects in the case of the null model.
```{r}
knitr::kable(get_est_means(model = rats_pois, table = rats_table))
```
```{r}
knitr::kable(get_est_means(model = rats_treat, table = rats_table))
```
```{r}
knitr::kable(get_est_means(model = rats_null, table = rats_table))
```
To perform the needed tests, we will need to create another ANOVA table. (We'll skip the details of the sums of squares calculations and simply let `R` take care of them.)
| Source | Sum of Squares | Degrees of Freedom | Mean Square | $F$ |
|-----------|----------|----------------|-----------|---------|
| Factor A | SSA | $I -1$ | SSA / DFA | MSA / MSE |
| Factor B | SSB | $J -1$ | SSB / DFB | MSB / MSE |
| AB Interaction | SSAB | $(I -1)(J -1)$ | SSAB / DFAB | MSAB / MSE |
| Error | SSE | $IJ(K - 1)$ | SSE / DFE | |
| Total | SST | $IJK - 1$ | | |
The row for **AB Interaction** tests:
\[
H_0: \text{ All }(\alpha\beta)_{ij} = 0. \quad \text{vs} \quad H_1: \text{ Not all } (\alpha\beta)_{ij} \text{ are } 0.
\]
- Null Model: $y_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk}.$ (Additive Model.)
- Alternative Model: $y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk}.$ (Interaction Model.)
We reject the null when the $F$ statistic is large. Under the null hypothesis, the distribution of the test statistic is $F$ with degrees of freedom $(I -1)(J -1)$ and $IJ(K - 1)$.
The row for **Factor B** tests:
\[
H_0: \text{ All }\beta_{j} = 0. \quad \text{vs} \quad H_1: \text{ Not all } \beta_{j} \text{ are } 0.
\]
- Null Model: $y_{ijk} = \mu + \alpha_i + \epsilon_{ijk}.$ (Only Factor A Model.)
- Alternative Model: $y_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk}.$ (Additive Model.)
We reject the null when the $F$ statistic is large. Under the null hypothesis, the distribution of the test statistic is $F$ with degrees of freedom $J - 1$ and $IJ(K - 1)$.
The row for **Factor A** tests:
\[
H_0: \text{ All }\alpha_{i} = 0. \quad \text{vs} \quad H_1: \text{ Not all } \alpha_{i} \text{ are } 0.
\]
- Null Model: $y_{ijk} = \mu + \beta_j + \epsilon_{ijk}.$ (Only Factor B Model.)
- Alternative Model: $y_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk}.$ (Additive Model.)
We reject the null when the $F$ statistic is large. Under the null hypothesis, the distribution of the test statistic is $F$ with degrees of freedom $I - 1$ and $IJ(K - 1)$.
These tests should be performed according to the model **hierarchy**. First consider the test of interaction. If it is significant, we select the interaction model and perform no further testing. If interaction is not significant, we then consider the necessity of the individual factors of the additive model.
![Model Hierarchy](images/hierarchy.png)
```{r}
summary(aov(time ~ poison * treat, data = rats))
```
Using a significance level of $\alpha = 0.05$, we see that the interaction is not significant. Within the additive model, both factors are significant, so we select the additive model.
<!-- - TODO: mention no replication, interaction issue -->
<!-- - TODO: correct if int is not significant? -->
<!-- - TODO: here we pick the full additive model -->
Within the additive model, we could do further testing about the main effects.
```{r}
TukeyHSD(aov(time ~ poison + treat, data = rats))
```
```{r, echo = FALSE, eval = FALSE}
# future hw qusetion?
# rate of dying = 1 / time
# change anything? does result change?
#plot(aov(1 / time ~ poison * treat, data = rats))
#summary(aov(1 / time ~ poison * treat, data = rats))
```
For an example **with** interaction, we investigate the `warpbreaks` dataset, a default dataset in `R`.
```{r, fig.height = 5, fig.width = 15}
par(mfrow = c(1, 2))
with(warpbreaks, interaction.plot(wool, tension, breaks, lwd = 2, col = 2:4))
with(warpbreaks, interaction.plot(tension, wool, breaks, lwd = 2, col = 2:3))
```
Either plot makes it rather clear that the `wool` and `tensions` factors interact.
```{r}
summary(aov(breaks ~ wool * tension, data = warpbreaks))
```
Using an $\alpha$ of $0.05$, the ANOVA test finds that the interaction is significant, so we use the interaction model here.
<!-- TODO: two-way calculation details in an appendix -->
```{r, echo = FALSE, eval = FALSE}
TukeyHSD(aov(breaks ~ wool * tension, data = warpbreaks))
```
## `R` Markdown
The `R` Markdown file for this chapter can be found here:
- [`anova.Rmd`](anova.Rmd){target="_blank"}
The file was created using `R` version ``r paste0(version$major, "." ,version$minor)``.