-
Notifications
You must be signed in to change notification settings - Fork 1
/
12-module-12.Rmd
362 lines (251 loc) · 17 KB
/
12-module-12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
# Assumptions {#module-12}
## Learning Objectives
In this chapter we will review assumptions of MLMs and some ways to check them in R.
The learning objectives for this chapter are:
1. Understand the assumptions underpinning MLMs;
2. Develop and interpret R code and output for testing assumptions.
All materials for this chapter are available for download [here](https://www.learn-mlms.com/13-appendix.html).
## Data Demonstration
### Load Data and Dependencies
For this data demo, we will use the following packages:
```{r message=FALSE, warning=FALSE}
library(lme4) # for multilevel models
library(lmerTest) # for p-values
library(dplyr) # for data manipulation
library(ggplot2) # for graphing
```
The data for chapter are a subsample from the 1982 High School Beyond data collected by the National Center for Educational Statistics, used as example data throughout Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods: SAGE Publications. These data are related to student math achievement.
```{r, eval=FALSE}
data <- read.csv('rb2002.csv')
```
```{r, echo = FALSE}
# this actually loads my code, but will be hidden
data <- read.csv('data/rb2002.csv', fileEncoding = "UTF-8-BOM")
```
### Assumptions of MLMs
In brief, the assumptions underlying MLMs are as follows:
1. The model is correctly specified (i.e., all the predictors associated with the outcome and relevant random effects are included);
2. The functional form is correct (e.g., the relationship between the predictors and outcome is linear if using a linear model);
3. Level-1 residuals are independent and normally distributed;
4. Level-2 residuals are independent and multivariate normally distributed;
5. Residuals at level-1 and level-2 are unrelated;
6. Predictors at one level are not related to errors at another level (homoscedasticity).
In the data demonstration that follows, we will be showing some techniques for checking these assumptions. However, we note that there are many other ways to conduct similar checks. Assumption checks are data exploration at the end of the day. If you know and prefer other methods of exploring the same facets of your data, go for it!
### Assumption 1: Model Specification
Specifying your model means determining which predictors should be included and what parameters to estimate. A correctly-specified model is one that includes everything associated with your outcome and nothing not-associated. This is impossible to know for certain and most researchers are constrained by the resources they have to get data and, ultimately, by the data they have. A general approach for considering this is trying to include all possible relevant IVs when predicting a DV. We recommend a three step approach. (1) Start with a null model to check the extent of clustering. (2) Run your "core" model, what you expect to be included according to theory. (3) Run a maximal model with all possible random effects, and trim down as you see effects are zero. Comparing models with and without effects and reviewing the literature for important variables to consider are decent strategies here. Get experts to read your results and their ideas on if anything is missing!
### Assumption 2: Functional Form is Correct
The functional form of your model refers to the function connecting your predictors and outcome. In linear models like MLMs, that form is (you guessed it!) linear. However, you can use MLMs for non-linear data by specifying other functions, which are not covered in the current set of modules. In our example data, let's look at the relationship between an outcome of math achievement (`mathach`) and a predictor of socioeconomic status (`ses`). We can get a sense of whether that relationship is linear with a scatterplot:
```{r}
data %>%
ggplot(mapping = aes(x = ses, y = mathach)) +
geom_point()
```
The scatterplot is hard to read because we have over 11,000 observations in our data. One option to improve readability is reducing the opacity of each point to get a clearer picture of how many points there are:
```{r}
data %>%
ggplot(mapping = aes(x = ses, y = mathach)) +
geom_point(alpha = .2)
```
It looks loosely linear. Before we continue, for teaching purposes we'll take a subset of 30 schools to make plotting easier for the rest of this data demo. **For your data, you should consider all observations**; this is just to make examples clearer.
```{r}
data <- data %>%
filter(SCHOOL %in% head(unique(SCHOOL), n = 30))
```
Let's take a look at the relationship between SES and math achievement for each of the 30 schools in our subsetted data.
```{r}
data %>%
ggplot() +
geom_point(mapping = aes(x = ses, y = mathach)) +
facet_wrap(~ SCHOOL)
```
Looks like we have vaguely linear relationships; nothing looks quadratic. This is a subjective determination, be open to other, non-linear functions if a straight line doesn't appear to do the job. Finally, let's graph all points on one plane again but with regression lines overlaid for each school.
```{r}
data %>%
group_by(SCHOOL) %>%
ggplot(mapping = aes(x = ses, y = mathach, colour = factor(SCHOOL))) +
geom_point(show.legend = FALSE) +
geom_smooth(method = lm, se = FALSE, show.legend = FALSE, fullrange = TRUE)
```
This isn't an explicit test of functional form, but it looks like there is variance in intercepts and slopes across schools that merits multilevel modelling (or cluster-robust standard errors, if you don't have any multilevel research questions).
### An Aside: Extracting Residuals
In linear regression, we have the assumption of homoscedasticity. That is, the variance of residuals is roughly the same at any value of of your predictor(s). In multilevel models, we have two classes of residuals: level-1 residuals and level-2 residuals. The remaining assumptions all deal with these residuals: that they are normally distributed, that they don't relate to each other, that they don't relate to predictors at the same level or at the other level. We're going to look at a model with math achievement predicted by SES centered within cluster (school) and mean SES predicting the intercept and slope of SES centered within cluster. We'll estimate a random effect variance for the slope of SES centered within cluster. We're not estimating the random effect covariance.
| Level | Equation |
|:-------|:---------|
|Level 1 | $mathach_{ij} = \beta_{0j} + \beta_{1j}cwc\_ses + R_{ij}$|
|Level 2 | $\beta_{0j} = \gamma_{00} + \gamma_{10}ses\_mean + U_{0j}$|
| | $\beta_{1j} = \gamma_{10} + \gamma_{11}ses\_mean + U_{1j}$|
|Combined| $math_{ij} = \gamma_{00} + \gamma_{01}ses\_mean + \gamma_{10}cwc\_ses + \gamma_{11}cwc\_ses*ses\_mean + U_{0j} + U_{1j}cwc\_ses + R_{ij}$|
```{r}
model <- lmer(mathach ~ CWCses*ses_mean + (1|SCHOOL) + (0 + CWCses|SCHOOL), data = data, REML = TRUE)
summary(model)
```
Let's extract the level-1 residuals and add them to our data to work with moving forward:
```{r}
data$l1resid <- residuals(model)
head(data$l1resid)
```
We'll extract the level-2 residuals shortly.
### Assumption 3: Level-1 Residuals are Independent and Normally Distributed
#### Independent
Our level-1 residuals should be independent of our level-1 predictors. For our example, if our level-1 residuals ($\sigma^2$) correlate to SES centered within cluster, we have heteroscedasticity, which we don't want. We can check this visually by creating a scatterplot of the predictor and the residuals:
```{r}
data %>%
ggplot(mapping = aes(x = CWCses, y = l1resid)) +
geom_point() +
labs(x = "CWCses", y = "residuals")
```
Those don't look correlated; they look like a blob. We can check this statistically by calculating the correlation.
```{r}
cor.test(data$l1resid, data$CWCses)
```
That correlation is essentially zero, meaning that our level-1 residuals and level-1 predictor are indeed uncorrelated!
#### Normally Distributed
We can check whether our level-1 residuals are normally distributed with a histogram:
```{r}
data %>%
ggplot(mapping = aes(x = l1resid)) +
geom_histogram()
```
Those look normal-ish — there's maybe something odd going on in that the distribution is bimodal, but that can be influenced by the number of bins. For example, with fewer bins:
```{r}
data %>%
ggplot(mapping = aes(x = l1resid)) +
geom_histogram(bins = 15)
```
Let's clear this up with a quantile-quantile (QQ) plot:
```{r}
data %>%
ggplot(mapping = aes(sample = l1resid)) +
stat_qq()
```
This plots quantiles for two distributions against each other: what proportion of points in our distribution is in a given quantile (y-axis), and how does that compare to the proportion of points in a given quantile for our theoretical normal distribution (x-axis). The line is straight, which indicates that our distribution is roughly normal.
### Assumption 4: Level-2 Residuals are Independent and Multivariate Normal
To check our level-2 residual assumptions, we need to extract the level-2 residuals and add them to our data. This requires a little more work than extracting the level-1 residuals. We could easily add level-1 residuals to our dataset because every row is a person and had a residual. Level-2 residuals are at the school level, so the vector of residuals cannot be appended as easily. Let's create a level-2 dataset to work with instead.
```{r}
l2_data <- data %>%
group_by(SCHOOL) %>% # group data by clustering variable, school
mutate(
mathach_mean = mean(mathach) # create mean math achievement per school
) %>%
select(SCHOOL, ses_mean, mathach_mean) %>%
unique() # select unique rows (rather than having school, ses_mean, and mathach_mean repeating over and over again)
```
Then let's add the level-2 residuals to this dataset. We get two columns of residuals with the `ranef(model)$SCHOOL` command; the first column is intercept residuals $U_{0j}$s, the second column the slope residuals $U_{1j}$s. Matrices are indexed as [row, column], so we use the `[, 1]` code to get all rows from column 1 and `[, 2]` to get all rows from column 2.
```{r}
l2_data$intercept_resid = ranef(model)$SCHOOL[, 1]
l2_data$slope_resid = ranef(model)$SCHOOL[, 2]
```
#### Independent
Like with level-1 residuals, we want our level-2 residuals to be independent from level-2 predictors. We also want both our level-2 residuals — intercept and slope residuals — to be independent of each other (which was not an issue at level-1, given that we only had one category of residual, $\sigma^2$). We can check whether our level-2 residuals are independent from the level-2 predictors and from one another as we did before: with a scatterplot and/or by calculating the correlation among them.
First, let's look at the relationship between the intercept residuals and the level-2 variable of mean SES.
```{r}
l2_data %>%
ggplot(mapping = aes(x = intercept_resid, y = ses_mean)) +
geom_point()
```
And the correlation:
```{r}
cor.test(l2_data$ses_mean, l2_data$intercept_resid)
```
Those look uncorrelated!
Next, let's check that the slope residuals are independent from the predictor of mean SES:
```{r}
l2_data %>%
ggplot(mapping = aes(x = slope_resid, y =ses_mean)) +
geom_point()
cor.test(l2_data$ses_mean, l2_data$slope_resid)
```
Those also look uncorrelated!
Finally, let's check whether the level-2 residuals are independent of each other:
```{r}
l2_data %>%
ggplot(mapping = aes(x = slope_resid, y = intercept_resid)) +
geom_point()
cor.test(l2_data$intercept_resid, l2_data$slope_resid)
```
Our model assumes that intercept and slope residuals are independent, but that doesn't seem to be the case. We could relax this by adding the covariance back in.
#### Multivariate Normal
As we did with our level-1 residuals, we can check the normality of our level-2 residuals with histograms and QQ plots. First, intercept residuals:
```{r}
l2_data %>%
ggplot(mapping = aes(x = intercept_resid)) +
geom_histogram(binwidth = .75)
l2_data %>%
ggplot(mapping = aes(sample = intercept_resid)) +
stat_qq()
```
Those look a bit wonky. When residuals are not normal, the assumption that the data are random doesn't hold, so that's a big issue. You might need to revisit your sampling strategy or your model: are predictors missing? You can read more in [Pek, Wong, and Wong (2018)](https://www.frontiersin.org/articles/10.3389/fpsyg.2018.02104/full).
Next, our slope residuals:
```{r}
l2_data %>%
ggplot(mapping = aes(x = slope_resid)) +
geom_histogram(binwidth = .50)
l2_data %>%
ggplot(mapping = aes(sample = slope_resid)) +
stat_qq()
```
Similar issue: wonky and potentially concerning.
### Assumption 5: Residuals at Level-1 and Level-2 are Independent
Our residuals at level-1 and level-2 should not be related to each other. We can check that with scatterplots and calculating correlations. To do so, we first need to add our level-2 residuals to our level-1 dataset. As mentioned, each school only has one level-2 intercept residual and one level-2 slope residual, but as many level-1 residuals as there are students in that school To add our level-2 residuals to the dataset, then, we need to replicate them X times, where X is the number of students in a school. For example, if a school has 10 students, we need to repeat the level-2 residuals 10 times. (This is how level-2 predictors are encoded, too: for a school with 10 students, the mean values of SES for that school repeats 10 times in the dataset.)
To do that, let's use the replication function `rep` with the `times` argument to repeat each value X times. We can find the number of students per school as follows:
```{r}
n_per_school <- data %>%
group_by(SCHOOL) %>% # group by school
select(SCHOOL) %>% # we just want to count schools
count() %>%
ungroup() %>%
select(n) %>%
unlist()
```
Then let's create vectors of level-1 and level-2 residuals that repeat X times:
```{r}
data$intercept_resid <- rep(l2_data$intercept_resid, times = n_per_school)
data$slope_resid <- rep(l2_data$slope_resid, times = n_per_school)
```
Now that we have a dataset with all of our residuals, let's assess whether the level-1 and level-2 residuals are correlated with our trusty scatterplots and correlations. First, level-2 intercept residuals and level-1 residuals.
```{r}
data %>%
ggplot(mapping = aes(x = l1resid, y = intercept_resid)) +
geom_point()
cor.test(data$l1resid, data$intercept_resid)
```
This looks odd at first, but the striation is because for any one level-2 residual there are multiple level-1 residuals. This looks decent; for each level-2 residual, there seems to be a decent spread of level-1 residuals. We would be concerned if level-1 residuals were clustering so that, say, for lower level-2 residuals there were also lower level-1 residuals or vice versa. Our correlation is also quite small, emphasizing that things seem fine.
Next, level-1 residuals with level-2 slope residuals:
```{r}
data %>%
ggplot(mapping = aes(x = l1resid, y = slope_resid)) +
geom_point()
cor.test(data$l1resid, data$slope_resid)
```
This similarly looks fine.
### Assumption 6: Level-1 Residuals Independent of Level-2 Predictors, Level-2 Residuals Independent of Level-1 Predictors
We've tested that our residuals are independent of predictors at their same level — level-1 residuals with `ses_cwc` and level-2 residuals with `ses_mean`. We also want residuals to be independent of predictors at the other level — level-1 residuals independent of `ses_mean` and level-2 residuals independent of `ses_cwc`. How do we test this? Scatterplots and correlations, of course. Let's check level-1 residuals first:
```{r}
data %>%
ggplot(mapping = aes(x = l1resid, y = ses_mean)) +
geom_point()
cor.test(data$l1resid, data$ses_mean)
```
The scatterplot looks like there might be a pattern where lower values on ses_mean are associated with higher residuals, but the correlation belies this.
Next level-2 intercept residuals:
```{r}
data %>%
ggplot(mapping = aes(x = intercept_resid, y = CWCses)) +
geom_point()
cor.test(data$intercept_resid, data$CWCses)
```
Looks good!
And finally level-2 slope residuals:
```{r}
data %>%
ggplot(mapping = aes(x = slope_resid, y = CWCses)) +
geom_point()
cor.test(data$slope_resid, data$CWCses)
```
Looks great!
## Conclusion
In this chapter, we reviewed the assumptions underlying multilevel models and examined some methods to assess them.
This is the last chapter of the book! We've covered the fundamentals: when and why you would use a multilevel model, models of varying complexity and completeness, and considerations as you build your models like estimation, various data structures, and interpreting effect sizes. Thank you for reading. If you have any suggestions or feedback, feel free to get in touch on Twitter ([Mairead](www.twitter.com/maireadkshaw), [Jessica](www.twitter.com/jkayflake)) or on [GitHub](https://github.com/mkshaw/learn-mlms).
## Further Reading
Pek, J., Wong, O., & Wong, A. C. M. (2018). How to Address Non-normality: A Taxonomy of Approaches, Reviewed, and Illustrated. Frontiers in Psychology, 9. doi:10.3389/fpsyg.2018.02104
Snijders, T. A. B., & Bosker, R. J. (2011). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling: SAGE Publications.