-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path102-appendix-variability.Rmd
61 lines (48 loc) · 3.17 KB
/
102-appendix-variability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Measures of Variability {#appendix-variability}
In Chapter \@ref(vocabulary), we make use of non-parametric measures of variability, especially MADM (mean absolute deviation from the median) rather than the more standard coefficient of variation (CV). In this brief Appendix, we show that these measures are very highly correlated in the limit. We note however that the two measures (CV and MADM) produce quite different answers for individual data points, especially those that are at the floor or ceiling of a particular form.
```{r appvar-vocab_data}
num_words <- items %>%
filter(type == "word") %>%
group_by(language, form) %>%
summarise(n = n())
vocab_data <- admins %>%
select(data_id, language, form, age, sex,
mom_ed, birth_order, production, comprehension) %>%
left_join(num_words) %>%
mutate(no_production = n - production)
```
```{r appvar-madm_normal}
ratios <- vocab_data %>%
filter(form %in% c(WSs,WGs)) %>%
group_by(language, form, age) %>%
filter(n() > 20) %>%
summarise(mmad = mmad(production),
madm = madm(production),
cv = cv(production),
d = d(production),
N = n())
madm_cv <- cor.test(ratios$madm, ratios$cv)
```
Figure \@ref(fig:appvar-ratios-cv) shows CV and MADM plotted together, with each point representing a single age group for a particular combination of form and language. The slope of the relationship between the two measures is strong ($r$ = `r roundp(madm_cv$estimate)`) and its slope is close to 1, despite some variation. Overall, it appears that for the majority of the data, CV is slightly lower than MADM, but that it is dramatically higher for some individual datasets. We speculate that this relationship is due to floor/ceiling effects and small sample effects. This analysis suggests that MADM, the non-parametric estimate we use, is less subject to extreme fluctuations than CV.
```{r appvar-ratios-cv, fig.cap="Coefficient of variation and MADM, with each point showing a particular combination of language, form, and age and the line indicating a linear model fit."}
ggplot(ratios, aes(x = madm, y = cv)) +
coord_fixed() +
geom_point(aes(size = N, col = language), alpha = .3) +
geom_smooth(method = "lm", colour = .pal()(1), size = 1.3) +
ylim(0,3) + xlim(0,3) +
geom_abline(linetype = .refline, colour = .grey) +
ylab("Coefficient of variation") +
xlab("MADM") +
scale_colour_manual(values = lang_colours, guide = FALSE)
```
```{r appvar-ratios-d, fig.cap="Cohen's d and MMAD, with each point showing a particular combination of language, form, and age and the line indicating a linear model fit."}
ggplot(ratios, aes(x = mmad, y = d)) +
geom_point(aes(size = N, col = language), alpha = .3) +
geom_smooth(method = "lm") +
ylim(0,3) + xlim(0,3) +
geom_abline(lty = 2) +
ylab("Cohen's d") +
xlab("MMAD") +
scale_colour_manual(values = lang_colours, guide = FALSE)
```
Our second analysis, shown in Figure \@ref(fig:appvar-ratios-d), is identical except that it plots Cohen's $d$ by MMAD. Each of these is the reciprocal of the related measure plotted above. (For example, $d = \frac{\mu}{\sigma}$ whereas $CV = \frac{\sigma}{\mu}$). Thus, the same relation holds.