-
Notifications
You must be signed in to change notification settings - Fork 230
/
model-basics.Rmd
594 lines (482 loc) · 17.5 KB
/
model-basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
---
output: html_document
editor_options:
chunk_output_type: console
---
# Model basics {#model-basics .r4ds-section}
## Introduction {#introduction-15 .r4ds-section}
```{r setup,message=FALSE,cache=FALSE}
library("tidyverse")
library("modelr")
```
The option `na.action` determines how missing values are handled.
It is a function.
`na.warn` sets it so that there is a warning if there are any missing values.
If it is not set (the default), R will silently drop them.
```{r}
options(na.action = na.warn)
```
## A simple model {#a-simple-model .r4ds-section}
### Exercise 23.2.1 {.unnumbered .exercise data-number="23.2.1"}
<div class="question">
One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualize the results. Rerun a few times to generate different simulated datasets. What do you notice about the model?
</div>
<div class="answer">
```{r}
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
```
Let's run it once and plot the results:
```{r}
ggplot(sim1a, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
We can also do this more systematically, by generating several simulations
and plotting the line.
```{r}
simt <- function(i) {
tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2),
.id = i
)
}
sims <- map_df(1:12, simt)
ggplot(sims, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", colour = "red") +
facet_wrap(~.id, ncol = 4)
```
What if we did the same things with normal distributions?
```{r}
sim_norm <- function(i) {
tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rnorm(length(x)),
.id = i
)
}
simdf_norm <- map_df(1:12, sim_norm)
ggplot(simdf_norm, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", colour = "red") +
facet_wrap(~.id, ncol = 4)
```
There are not large outliers, and the slopes are more similar.
The reason for this is that the Student's $t$-distribution, from which we sample with `rt()` has heavier tails than the normal distribution (`rnorm()`). This means that the Student's t-distribution
assigns a larger probability to values further from the center of the distribution.
```{r}
tibble(
x = seq(-5, 5, length.out = 100),
normal = dnorm(x),
student_t = dt(x, df = 2)
) %>%
pivot_longer(-x, names_to="distribution", values_to="density") %>%
ggplot(aes(x = x, y = density, colour = distribution)) +
geom_line()
```
For a normal distribution with mean zero and standard deviation one, the probability of being greater than 2 is,
```{r}
pnorm(2, lower.tail = FALSE)
```
For a Student's $t$ distribution with degrees of freedom = 2, it is more than 3 times higher,
```{r}
pt(2, df = 2, lower.tail = FALSE)
```
</div>
### Exercise 23.2.2 {.unnumbered .exercise data-number="23.2.2"}
<div class="question">
One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance:
</div>
<div class="answer">
```{r}
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
mean(abs(diff))
}
```
For the above function to work, we need to define a function, `make_prediction()`, that
takes a numeric vector of length two (the intercept and slope) and returns the predictions,
```{r}
make_prediction <- function(mod, data) {
mod[1] + mod[2] * data$x
}
```
Using the `sim1a` data, the best parameters of the least absolute deviation are:
```{r}
best <- optim(c(0, 0), measure_distance, data = sim1a)
best$par
```
Using the `sim1a` data, while the parameters the minimize the least squares objective function are:
```{r}
measure_distance_ls <- function(mod, data) {
diff <- data$y - (mod[1] + mod[2] * data$x)
sqrt(mean(diff^2))
}
best <- optim(c(0, 0), measure_distance_ls, data = sim1a)
best$par
```
In practice, I suggest not using `optim()` to fit this model, and instead using an existing implementation.
The `rlm()` and `lqs()` functions in the [MASS](https://CRAN.R-project.org/package=MASS) fit robust and resistant linear models.
</div>
### Exercise 23.2.3 {.unnumbered .exercise data-number="23.2.3"}
<div class="question">
One challenge with performing numerical optimization is that it’s only guaranteed to find a local optimum. What’s the problem with optimizing a three parameter model like this?
</div>
<div class="answer">
```{r}
model3 <- function(a, data) {
a[1] + data$x * a[2] + a[3]
}
```
The problem is that you for any values `a[1] = a1` and `a[3] = a3`, any other values of `a[1]` and `a[3]` where `a[1] + a[3] == (a1 + a3)` will have the same fit.
```{r}
measure_distance_3 <- function(a, data) {
diff <- data$y - model3(a, data)
sqrt(mean(diff^2))
}
```
Depending on our starting points, we can find different optimal values:
```{r}
best3a <- optim(c(0, 0, 0), measure_distance_3, data = sim1)
best3a$par
```
```{r}
best3b <- optim(c(0, 0, 1), measure_distance_3, data = sim1)
best3b$par
```
```{r}
best3c <- optim(c(0, 0, 5), measure_distance_3, data = sim1)
best3c$par
```
In fact there are an infinite number of optimal values for this model.
<!-- How to discuss this better ?
Problem is that due to finite iterations, numerically these converge:
> sum(best3a$par[c(1, 3)])
[1] 4.220074
> sum(best3b$par[c(1, 3)])
[1] 4.220404
> sum(best3c$par[c(1, 3)])
[1] 4.22117
-->
</div>
## Visualising models {#visualising-models .r4ds-section}
### Exercise 23.3.1 {.unnumbered .exercise data-number="23.3.1"}
<div class="question">
Instead of using `lm()` to fit a straight line, you can use `loess()` to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualization on `sim1` using `loess()` instead of `lm()`. How does the result compare to `geom_smooth()`?
</div>
<div class="answer">
I'll use `add_predictions()` and `add_residuals()` to add the predictions and residuals from a loess regression to the `sim1` data.
```{r}
sim1_loess <- loess(y ~ x, data = sim1)
sim1_lm <- lm(y ~ x, data = sim1)
grid_loess <- sim1 %>%
add_predictions(sim1_loess)
sim1 <- sim1 %>%
add_residuals(sim1_lm) %>%
add_predictions(sim1_lm) %>%
add_residuals(sim1_loess, var = "resid_loess") %>%
add_predictions(sim1_loess, var = "pred_loess")
```
This plots the loess predictions.
The loess produces a nonlinear, smooth line through the data.
```{r}
plot_sim1_loess <-
ggplot(sim1, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(x = x, y = pred), data = grid_loess, colour = "red")
plot_sim1_loess
```
The predictions of loess are the same as the default method for `geom_smooth()` because `geom_smooth()` uses `loess()` by default; the message even tells us that.
```{r message=TRUE}
plot_sim1_loess +
geom_smooth(method = "loess", colour = "blue", se = FALSE, alpha = 0.20)
```
We can plot the residuals (red), and compare them to the residuals from `lm()` (black).
In general, the loess model has smaller residuals within the sample (out of sample is a different issue, and we haven't considered the uncertainty of these estimates).
```{r}
ggplot(sim1, aes(x = x)) +
geom_ref_line(h = 0) +
geom_point(aes(y = resid)) +
geom_point(aes(y = resid_loess), colour = "red")
```
</div>
### Exercise 23.3.2 {.unnumbered .exercise data-number="23.3.2"}
<div class="question">
`add_predictions()` is paired with `gather_predictions()` and `spread_predictions()`.
How do these three functions differ?
</div>
<div class="answer">
The functions `gather_predictions()` and `spread_predictions()` allow for adding predictions from multiple models at once.
Taking the `sim1_mod` example,
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
grid <- sim1 %>%
data_grid(x)
```
The function `add_predictions()` adds only a single model at a time.
To add two models:
```{r}
grid %>%
add_predictions(sim1_mod, var = "pred_lm") %>%
add_predictions(sim1_loess, var = "pred_loess")
```
The function `gather_predictions()` adds predictions from multiple models by
stacking the results and adding a column with the model name,
```{r}
grid %>%
gather_predictions(sim1_mod, sim1_loess)
```
The function `spread_predictions()` adds predictions from multiple models by
adding multiple columns (postfixed with the model name) with predictions from each model.
```{r}
grid %>%
spread_predictions(sim1_mod, sim1_loess)
```
The function `spread_predictions()` is similar to the example which runs `add_predictions()` for each model, and is equivalent to running `spread()` after
running `gather_predictions()`:
```{r}
grid %>%
gather_predictions(sim1_mod, sim1_loess) %>%
spread(model, pred)
```
</div>
### Exercise 23.3.3 {.unnumbered .exercise data-number="23.3.3"}
<div class="question">
What does `geom_ref_line()` do? What package does it come from?
Why is displaying a reference line in plots showing residuals useful and important?
</div>
<div class="answer">
The geom `geom_ref_line()` adds as reference line to a plot.
It is equivalent to running `geom_hline()` or `geom_vline()` with default settings that are useful for visualizing models.
Putting a reference line at zero for residuals is important because good models (generally) should have residuals centered at zero, with approximately the same variance (or distribution) over the support of x, and no correlation.
A zero reference line makes it easier to judge these characteristics visually.
</div>
### Exercise 23.3.4 {.unnumbered .exercise data-number="23.3.4"}
<div class="question">
Why might you want to look at a frequency polygon of absolute residuals?
What are the pros and cons compared to looking at the raw residuals?
</div>
<div class="answer">
Showing the absolute values of the residuals makes it easier to view the spread of the residuals.
The model assumes that the residuals have mean zero, and using the absolute values of the residuals effectively doubles the number of residuals.
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
sim1 <- sim1 %>%
add_residuals(sim1_mod)
ggplot(sim1, aes(x = abs(resid))) +
geom_freqpoly(binwidth = 0.5)
```
However, using the absolute values of residuals throws away information about the sign, meaning that the
frequency polygon cannot show whether the model systematically over- or under-estimates the residuals.
</div>
## Formulas and model families {#formulas-and-model-families .r4ds-section}
### Exercise 23.4.1 {.unnumbered .exercise data-number="23.4.1"}
<div class="question">
What happens if you repeat the analysis of `sim2` using a model without an intercept. What happens to the model equation?
What happens to the predictions?
</div>
<div class="answer">
To run a model without an intercept, add `- 1` or `+ 0` to the right-hand-side o f the formula:
```{r}
mod2a <- lm(y ~ x - 1, data = sim2)
```
```{r}
mod2 <- lm(y ~ x, data = sim2)
```
The predictions are exactly the same in the models with and without an intercept:
```{r}
grid <- sim2 %>%
data_grid(x) %>%
spread_predictions(mod2, mod2a)
grid
```
</div>
### Exercise 23.4.2 {.unnumbered .exercise data-number="23.4.2"}
<div class="question">
Use `model_matrix()` to explore the equations generated for the models I fit to `sim3` and `sim4`.
Why is `*` a good shorthand for interaction?
</div>
<div class="answer">
For `x1 * x2` when `x2` is a categorical variable produces indicator variables `x2b`, `x2c`, `x2d` and
variables `x1:x2b`, `x1:x2c`, and `x1:x2d` which are the products of `x1` and `x2*` variables:
```{r}
x3 <- model_matrix(y ~ x1 * x2, data = sim3)
x3
```
We can confirm that the variables `x1:x2b` is the product of `x1` and `x2b`,
```{r}
all(x3[["x1:x2b"]] == (x3[["x1"]] * x3[["x2b"]]))
```
and similarly for `x1:x2c` and `x2c`, and `x1:x2d` and `x2d`:
```{r}
all(x3[["x1:x2c"]] == (x3[["x1"]] * x3[["x2c"]]))
all(x3[["x1:x2d"]] == (x3[["x1"]] * x3[["x2d"]]))
```
For `x1 * x2` where both `x1` and `x2` are continuous variables, `model_matrix()` creates variables
`x1`, `x2`, and `x1:x2`:
```{r}
x4 <- model_matrix(y ~ x1 * x2, data = sim4)
x4
```
Confirm that `x1:x2` is the product of the `x1` and `x2`,
```{r}
all(x4[["x1"]] * x4[["x2"]] == x4[["x1:x2"]])
```
The asterisk `*` is good shorthand for an interaction since an interaction between `x1` and `x2` includes
terms for `x1`, `x2`, and the product of `x1` and `x2`.
</div>
### Exercise 23.4.3 {.unnumbered .exercise data-number="23.4.3"}
<div class="question">
Using the basic principles, convert the formulas in the following two models into functions.
(Hint: start by converting the categorical variable into 0-1 variables.)
</div>
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
<div class="answer">
The problem is to convert the formulas in the models into functions.
I will assume that the function is only handling the conversion of the right hand side of the formula into a model matrix.
The functions will take one argument, a data frame with `x1` and `x2` columns,
and it will return a data frame.
In other words, the functions will be special cases of the `model_matrix()` function.
Consider the right hand side of the first formula, `~ x1 + x2`.
In the `sim3` data frame, the column `x1` is an integer, and the variable `x2` is a factor with four levels.
```{r}
levels(sim3$x2)
```
Since `x1` is numeric it is unchanged.
Since `x2` is a factor it is replaced with columns of indicator variables for all but one of its levels.
I will first consider the special case in which `x2` only takes the levels of `x2` in `sim3`.
In this case, "a" is considered the reference level and omitted, and new columns are made for "b", "c", and "d".
```{r}
model_matrix_mod1 <- function(.data) {
mutate(.data,
x2b = as.numeric(x2 == "b"),
x2c = as.numeric(x2 == "c"),
x2d = as.numeric(x2 == "d"),
`(Intercept)` = 1
) %>%
select(`(Intercept)`, x1, x2b, x2c, x2d)
}
```
```{r}
model_matrix_mod1(sim3)
```
A more general function for `~ x1 + x2` would not hard-code the specific levels in `x2`.
```{r}
model_matrix_mod1b <- function(.data) {
# the levels of x2
lvls <- levels(.data$x2)
# drop the first level
# this assumes that there are at least two levels
lvls <- lvls[2:length(lvls)]
# create an indicator variable for each level of x2
for (lvl in lvls) {
# new column name x2 + level name
varname <- str_c("x2", lvl)
# add indicator variable for lvl
.data[[varname]] <- as.numeric(.data$x2 == lvl)
}
# generate the list of variables to keep
x2_variables <- str_c("x2", lvls)
# Add an intercept
.data[["(Intercept)"]] <- 1
# keep x1 and x2 indicator variables
select(.data, `(Intercept)`, x1, all_of(x2_variables))
}
```
```{r}
model_matrix_mod1b(sim3)
```
Consider the right hand side of the first formula, `~ x1 * x2`.
The output data frame will consist of `x1`, columns with indicator variables for each level (except the reference level) of `x2`,
and columns with the `x2` indicator variables multiplied by `x1`.
As with the previous formula, first I'll write a function that hard-codes the levels of `x2`.
```{r}
model_matrix_mod2 <- function(.data) {
mutate(.data,
`(Intercept)` = 1,
x2b = as.numeric(x2 == "b"),
x2c = as.numeric(x2 == "c"),
x2d = as.numeric(x2 == "d"),
`x1:x2b` = x1 * x2b,
`x1:x2c` = x1 * x2c,
`x1:x2d` = x1 * x2d
) %>%
select(`(Intercept)`, x1, x2b, x2c, x2d, `x1:x2b`, `x1:x2c`, `x1:x2d`)
}
```
```{r}
model_matrix_mod2(sim3)
```
For a more general function which will handle arbitrary levels in `x2`, I will
extend the `model_matrix_mod1b()` function that I wrote earlier.
```{r}
model_matrix_mod2b <- function(.data) {
# get dataset with x1 and x2 indicator variables
out <- model_matrix_mod1b(.data)
# get names of the x2 indicator columns
x2cols <- str_subset(colnames(out), "^x2")
# create interactions between x1 and the x2 indicator columns
for (varname in x2cols) {
# name of the interaction variable
newvar <- str_c("x1:", varname)
out[[newvar]] <- out$x1 * out[[varname]]
}
out
}
```
```{r}
model_matrix_mod2b(sim3)
```
These functions could be further generalized to allow for `x1` and `x2` to
be either numeric or factors. However, generalizing much more than that and
we will soon start reimplementing all of the `matrix_model()` function.
</div>
### Exercise 23.4.4 {.unnumbered .exercise data-number="23.4.4"}
<div class="question">
For `sim4`, which of `mod1` and `mod2` is better?
I think `mod2` does a slightly better job at removing patterns, but it’s pretty subtle.
Can you come up with a plot to support my claim?
</div>
<div class="answer">
Estimate models `mod1` and `mod2` on `sim4`,
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
```
and add the residuals from these models to the `sim4` data,
```{r}
sim4_mods <- gather_residuals(sim4, mod1, mod2)
```
Frequency plots of both the residuals,
```{r}
ggplot(sim4_mods, aes(x = resid, colour = model)) +
geom_freqpoly(binwidth = 0.5) +
geom_rug()
```
and the absolute values of the residuals,
```{r}
ggplot(sim4_mods, aes(x = abs(resid), colour = model)) +
geom_freqpoly(binwidth = 0.5) +
geom_rug()
```
does not show much difference in the residuals between the models.
However, `mod2` appears to have fewer residuals in the tails of the distribution between 2.5 and 5 (although the most extreme residuals are from `mod2`.
This is confirmed by checking the standard deviation of the residuals of these models,
```{r}
sim4_mods %>%
group_by(model) %>%
summarise(resid = sd(resid))
```
The standard deviation of the residuals of `mod2` is smaller than that of `mod1`.
</div>
## Missing values {#missing-values-5 .r4ds-section}
`r no_exercises()`
## Other model families {#other-model-families .r4ds-section}
`r no_exercises()`