-
Notifications
You must be signed in to change notification settings - Fork 1
/
regression_in_R.Rmd
230 lines (186 loc) · 6.76 KB
/
regression_in_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Linear Regression Functions in R"
output: html_document
---
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { extensions: ["autoload-all.js"] }
});
</script>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This tutorial uses the following libraries
```{r message = FALSE}
library("ggplot2")
library("dplyr")
library("car")
library("broom")
```
## Regression
### lm
This example makes use of the [Duncan Occpuational prestige](http://www.rdocumentation.org/packages/car/functions/Duncan) included in the [car](https://cran.r-project.org/web/packages/car/index.html) package.
This is data from a classic sociology paper and contains data on the prestige and other characteristics of 45 U.S. occupations in 1950.
```{r}
data("Duncan", package = "car")
```
The dataset `Duncan` contains four variables: `type`, `income`, `education`, and `prestige`,
```{r}
glimpse(Duncan)
```
You run a regression in R using the function `lm`.
This runs a linear regression of occupational prestige on income,
```{r}
lm(prestige ~ income, data = Duncan)
```
This estimates the linear regression
$$
\mathtt{prestige} = \beta_0 + \beta_1 \mathtt{income}
$$
In R, $\beta_0$ is named `(Intercept)`, and the other coefficients are named after the associated predictor.
The function `lm` returns an `lm` object that can be used in future computations.
Instead of printing the regression result to the screen, save it to the variable `mod1`,
```{r}
mod1 <- lm(prestige ~ income, data = Duncan)
```
We can print this object
```{r}
print(mod1)
```
Somewhat counterintuitively, the `summary` function returns **more** information about a regression,
```{r}
summary(mod1)
```
The `summary` function also returns an object that we can use later,
```{r}
summary_mod1 <- summary(mod1)
summary_mod1
```
Now lets estimate a multiple linear regression,
```{r}
mod2 <- lm(prestige ~ income + education + type, data = Duncan)
mod2
```
TODO: discusss the formula syntax in detail.
### Coefficients, Standard errors
Coefficients, $\hat{\boldsymbol{\beta}}$:
```{r}
coef(mod2)
```
Variance-covariance matrix of the coefficients, $\Var{\hat{\boldsymbol{\beta}}}$:
```{r}
vcov(mod2)
```
The standard errors of the coefficients, $\se{\hat{\boldsymbol{\beta}}}$, are the square root diagonal of the `vcov` matrix,
```{r}
sqrt(diag(vcov(mod2)))
```
This can be confirmed by comparing their values to those in the summary table,
```{r}
summary(mod2)
```
### Residuals, Fitted Values,
To get the fitted or predicted values ($\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\beta}$) from a regression,
```{r}
mod1_fitted <- fitted(mod1)
head(mod1_fitted)
```
or
```{r}
mod1_predict <- predict(mod1)
head(mod1_predict)
```
The difference between `predict` and `fitted` is how they handle missing values in the data. Fitted values will not include predictions for missing values in the data, while `predict` will include values for
Using `predict`, we can also predict values for new data.
For example, create a data frame with each category of `type`, and in which `income` and `education` are set to their mean values.
```{r}
Duncan_at_means <-
data.frame(type = unique(Duncan$type),
income = mean(Duncan$income),
education = mean(Duncan$education))
Duncan_at_means
```
Now use this with the `newdata` argument,
```{r}
predict(mod2, newdata = Duncan_at_means)
```
To get the residuals ($\hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}}$).
```{r}
mod1_resid <- residuals(mod1)
head(mod1_resid)
```
### Broom
The package broom has some functions that reformat the results of statistical modeling functions (`t.test`, `lm`, etc.) to data frames that work nicer with **ggplot2**, **dplyr**, and friends.
The **broom** package has three main functions:
- `glance`: Information about the model.
- `tidy`: Information about the estimated parameters
- `augment`: The original data with estimates of the model.
`glance`: Always return a one-row data.frame that is a summary of the model: e.g. R2, adjusted R2, etc.
```{r}
glance(mod2)
```
`tidy`: Transforms into a ready-to-go data.frame the coefficients, SEs (and CIs if given), critical values, and p-values in statistical tests’ outputs
```{r}
tidy(mod2)
```
`augment`: Add columns to the original data that was modeled. This includes predictions, estandard error of the predictions, residuals, and others.
```{r}
augment(mod2) %>% head()
```
- `.fitted`: the model predictions for all observations
- `.se.fit`: the estandard error of the predictions
- `.resid`: the residuals of the predictions (acual - predicted values)
- `.sigma`: is the standard error of the prediction.
The other columns---`.hat`, `.cooksd`, and `.std.resid` are used in regression diagnostics.
### Plotting Fitted Regression Results
Consider the regression of prestige on income,
```{r}
mod3 <- lm(prestige ~ income, data = Duncan)
```
This creates a new dataset with the column `income` and 100 observations between the min and maximum observed incomes in the Duncan dataset.
```{r}
mod3_newdata <- data_frame(income = seq(min(Duncan$income), max(Duncan$income), length.out = 100))
```
We will calculate fitted values for all these values of `income`.
```{r}
ggplot() +
geom_point(data = Duncan,
mapping = aes(x = income, y = prestige), colour = "gray75") +
geom_line(data = augment(mod3, newdata = mod3_newdata),
mapping = aes(x = income, y = .fitted)) +
ylab("Prestige") +
xlab("Income") +
theme_minimal()
```
Now plot something similar, but for a regression with `income` interacted with `type`,
```{r}
mod4 <- lm(prestige ~ income * type, data = Duncan)
```
We want to create a dataset which has, (1) each value of `type` in the Duncan data, and (2) values spanning the range of `income` in the Duncan data.
The function `expand.grid` creates a data frame with all combinations of vectors given to it (Cartesian product).
```{r}
mod4_newdata <- expand.grid(income = seq(min(Duncan$income), max(Duncan$income), length.out = 100), type = unique(Duncan$type))
```
Now plot the fitted values evaluated at each of these values along wite original values in the data,
```{r}
ggplot() +
geom_point(data = Duncan,
mapping = aes(x = income, y = prestige, color = type)) +
geom_line(data = augment(mod4, newdata = mod4_newdata),
mapping = aes(x = income, y = .fitted, color = type)) +
ylab("Prestige") +
xlab("Income") +
theme_minimal()
```
Running `geom_smooth` with `method = "lm"` gives similar results.
However, note that `geom_smooth` with run a **separate** regression for each group.
```{r}
ggplot(data = Duncan, aes(x = income, y = prestige, color = type)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ylab("Prestige") +
xlab("Income") +
theme_minimal()
```
### Plotting Residuals
TODO: