-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw4-solutions.Rmd
319 lines (256 loc) · 11.8 KB
/
hw4-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "POLS 503: Problem Set 4 Solutions"
author: "Jeffrey B. Arnold"
date: "May 15, 2015"
---
# Setup
```{r load, message = FALSE}
library("ggplot2")
library("dplyr")
library("broom")
library("tidyr")
source("http://uw-pols503.github.io/pols_503_sp15/hw/hw4-functions.R")
```
# Problems {#problems}
## Showing Confidence
We will revisit the sprinters data we considered in Problem Set 2.
```{r}
sprinters <- read.csv("http://uw-pols503.github.io/pols_503_sp15/data/sprinters.csv")
```
**a.** Estimate a linear regression of `finish` on `year` and `women`:
```{r}
mod1 <- lm(finish ~ year + women, data = sprinters)
```
We can create a plot in several ways.
Using the observed values and `augment()`:
This gets the critical value of the Student's $t$ distribution for the degrees of freedom in the regression and a 95% confidence interval (using 2 should be good enough, but I'm feeling pedantic today).
```{r}
tstar <- - qt((1 - 0.95)/ 2, df = mod1$df.residual)
```
```{r}
ggplot(mutate(augment(mod1),
sex = ifelse(as.logical(women), "women", "men")),
aes(x = year, y = .fitted,
ymax = .fitted + .se.fit * tstar,
ymin = .fitted - .se.fit * tstar,
colour = sex, fill = sex)) +
geom_ribbon(alpha = 0.2, colour = NA) +
geom_line() +
geom_point(mapping = aes(y = finish))
```
Using `predict()`:
```{r}
mod1_newdata <-
expand.grid(year = seq(min(sprinters$year), max(sprinters$year)),
women = c(0, 1))
mod1_yhat <-
bind_cols(mod1_newdata,
as.data.frame(predict(mod1, mod1_newdata, interval = "confidence"))) %>%
mutate(sex = ifelse(as.logical(women), "women", "men"))
ggplot() +
geom_ribbon(data = mod1_yhat,
mapping = aes(x = year, ymin = lwr, ymax = upr, fill = sex), alpha = 0.2) +
geom_line(data = mod1_yhat,
mapping = aes(x = year, y = fit, colour = sex)) +
geom_point(data = sprinters %>%
mutate(sex = ifelse(as.logical(women), "women", "men")),
mapping = aes(y = finish, x = year, colour = sex))
```
**b.** In this problem, I'll estimate a linear regression of `finish` on `year` and `women` and the interaction thereof:
```{r}
mod2 <- lm(finish ~ year * women, data = sprinters)
```
Although the plot of predicted values can be created in several ways (see the previous), but this time I'll use augment.
```{r}
ggplot(mutate(augment(mod2),
sex = ifelse(as.logical(women), "women", "men")),
aes(x = year, y = .fitted,
ymax = .fitted + .se.fit * tstar,
ymin = .fitted - .se.fit * tstar,
colour = sex, fill = sex)) +
geom_ribbon(alpha = 0.2, colour = NA) +
geom_line() +
geom_point(mapping = aes(y = finish))
```
**c.** In this problem, I'll estimate a linear regression of `finish` on `year`, `year` squared, and `women` and the interaction thereof:
```{r}
mod3 <- lm(finish ~ year * women + I(year ^ 2) * women, data = sprinters)
```
Although the plot of predicted values can be created in several ways (see the previous), but this time I'll use `augment()`.
```{r}
ggplot(mutate(augment(mod3),
sex = ifelse(as.logical(women), "women", "men")),
aes(x = year, y = .fitted,
ymax = .fitted + .se.fit * tstar,
ymin = .fitted - .se.fit * tstar,
colour = sex, fill = sex)) +
geom_ribbon(alpha = 0.2, colour = NA) +
geom_line() +
geom_point(mapping = aes(y = finish))
```
**d.** Compare the visual fit of these models to the data within the observed period. Which do you find plausible fits?
**e.** They have different predictions in 2156.
```{r}
predict(mod1, newdata = expand.grid(women = c(0, 1), year = 2156), interval = "confidence")
predict(mod2, newdata = expand.grid(women = c(0, 1), year = 2156), interval = "confidence")
predict(mod3, newdata = expand.grid(women = c(0, 1), year = 2156), interval = "confidence")
```
`mod1` predicts women will be faster than men.
`mod2` predicts that in 2156 the times of men and women will be equal.
`mod3` predicts that in 2156 women will be 4 seconds faster than men, and even slower in 2001.
**f.** Now create a new variable, the ratio of men’s time to women’s time in each
year.
Logit-transform this variable and regress it on year. Plot the results,with confidence intervals, on the scale of the ratio men’s time to women’s time (i.e., transform it back from logit).
Does this approach make any assumptions about men’s times or women’s times that might be problematic?
```{r}
sprinters2 <-
sprinters %>%
mutate(sex = ifelse(as.logical(women), "women", "men")) %>%
select(-women) %>%
spread(sex, finish) %>%
mutate(rat = men / women,
logit_rat = log(rat / (1 - rat))) %>%
na.omit()
logit <- function(x) log(x / (1 - x))
inv_logit <- function(x) exp(x) / (exp(x) + 1)
mod4 <- lm(logit(rat) ~ year, data = sprinters2)
ggplot(augment(mod4),
aes(x = year,
y = inv_logit(.fitted),
ymax = inv_logit(.fitted + .se.fit * 2),
ymin = inv_logit(.fitted - .se.fit * 2))) +
geom_ribbon(alpha = 0.2, colour = NA) +
geom_line() +
geom_point(mapping = aes(y = inv_logit(logit.rat.))) +
scale_y_continuous("logit(men finish / women finish)")
```
## Model Selection: Oil & Democracy
```{r}
ross95 <- read.csv("http://uw-pols503.github.io/pols_503_sp15/data/rossoildata.csv") %>%
group_by(id) %>%
mutate(oilL5 = lag(oil, 5),
metalL5 = lag(metal, 5),
GDPcapL5 = lag(GDPcap, 5)) %>%
filter(year == 1995) %>%
select(regime1, oilL5, metalL5, GDPcapL5, islam, oecd, cty_name, id, id1) %>%
na.omit() %>%
ungroup()
```
**a.**
```{r}
oil_mod1 <- lm(regime1 ~ oilL5 + metalL5 + GDPcapL5 + islam + oecd, data = ross95)
```
The standard error of the regression is
```{r}
glance(oil_mod1)$sigma
```
The change in `regime1` associated with a change from the 50th percentile to the 95th percentile of oil, holding all other variables at their means (including binary) is:
```{r}
oil_mod1_oil_predict_lo <-
predict(oil_mod1,
newdata = summarise(ross95,
oilL5 = median(oilL5),
metalL5 = mean(metalL5),
islam = mean(islam),
oecd = mean(oecd),
GDPcapL5 = mean(GDPcapL5)))
oil_mod1_oil_predict_hi <-
predict(oil_mod1,
newdata = summarise(ross95,
oilL5 = quantile(oilL5, 0.95),
metalL5 = mean(metalL5),
islam = mean(islam),
oecd = mean(oecd),
GDPcapL5 = mean(GDPcapL5)))
oil_mod1_oil_predict_hi -
oil_mod1_oil_predict_lo
```
**b.** Using the residuals from the regression in part a., create the following diagnostic plots:
```{r}
oil_mod1_aug <- augment(oil_mod1)
```
Plot predicted `regime1` versus `regime1`:
```{r}
ggplot(oil_mod1_aug, aes(x = regime1, y = .fitted)) +
geom_point() +
scale_y_continuous("predicted regime1")
```
Description of residuals against the fitted values
```{r}
ggplot(oil_mod1_aug, aes(x = regime1, y = .fitted)) +
geom_point() +
scale_y_continuous("predicted regime1")
```
To plot all the variables against their residuals in a single plot, I use the following
**dplyr** and **tidyr** code to make a new dataset with three variable: `varible` (variable name),
`x` (variable value), and `.fitted`.
```{r}
oil_mod1_resid_vs_x <-
oil_mod1_aug %>%
select(regime1, oilL5, metalL5, GDPcapL5, islam, oecd, .resid) %>%
gather(variable, x, -.resid)
```
Then it is easy to plot them on one plot,
```{r warning = FALSE, message = FALSE}
ggplot(oil_mod1_resid_vs_x, aes(x = x, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x + I(x^2)) +
facet_wrap(~ variable, scales = "free_x")
```
Alternatively, plot $x$ versus the $\sqrt{\epsilon_i^2}$
```{r, warning = FALSE, message = FALSE}
ggplot(oil_mod1_resid_vs_x, aes(x = x, y = sqrt(.resid^2))) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x + I(x^2)) +
facet_wrap(~ variable, scales = "free_x")
```
```{r}
bind_cols(tidy(oil_mod1),
data.frame(std.error.robust = sqrt(diag(car::hccm(oil_mod1))))) %>%
select(term, std.error, std.error.robust) %>%
mutate(std.error = round(std.error, 4),
std.error.robust = round(std.error.robust, 4))
```
1. Plot the residuals against the fitted values
2. Plot the residuals against each covariate
3. Plot the studentized residuals against the standardized hat values.
4. Calculate the heteroskedastic consistent standard errors and compare to the classical standard errors. Use the **car** function `hccm`.
**c.**
```{r}
oil_mod2<- lm(regime1 ~ logitBound(oilL5) + logitBound(metalL5) + log(GDPcapL5)
+ logitBound(islam) + oecd, data = ross95)
oil_mod3 <- lm(regime1 ~ logitBound(oilL5) + logitBound(metalL5) + log(GDPcapL5)
+ islam + oecd, data = ross95)
oil_mod4 <- lm(regime1 ~ oilL5 + metalL5 + log(GDPcapL5) + logitBound(islam)
+ oecd, data = ross95)
```
We can compare these models by $\sigma$:
```{r results='asis'}
model_comp <-
bind_rows(mutate(glance(mod1), model = "mod1"),
mutate(glance(mod2), model = "mod2"),
mutate(glance(mod3), model = "mod3"),
mutate(glance(mod4), model = "mod4"))
knitr::kable(model_comp %>% select(model, sigma))
```
c. Rerun the regression using either log or logit transformations on any covariates you see fit.
You will likely run several specifications.
In each run, record the standard error of the regression, and the expected change in `regime1` given a change in `oilL5` from the 50th percentile to the 95th percentile of the fully observed data.
See the appendix for some tips and warnings about transforming these data, though.
f. How much substantive difference does finding the best model make?
Be specific and concrete; i.e., show what each model does.
I’m asking for a more detailed answer than you usually see in articles.
How much substantive doubt is there in the result if we are not sure which of the models you fit is the "right" one?
g. Which model of those you have estimated do you trust most, and why?
What other problems in the specification or estimation method remain unaddressed by our efforts?
## Interpretation of results
Take special care in interpreting models in models with `logBound(x)` or `logitBound(x)` in the model formula.
In setting up a hypothetical scenario for post-estimation prediction, make sure both the dummy term and the log term are set consistent with each other.
For example, if the dummy is set to 0, the log must also be zero.
And if the log is set to something other than 0, the dummy must be set to 1.
Otherwise, you are asking the model to predict a logically impossible scenario; e.g., asking what happens when someone both smokes zero cigarettes and smokes twenty cigarettes in the same day.
I recommend either calculating the predicted values of `regime1` "by hand", or using the **simcf** package, as illustrated below.
Our old friend `predict()` is very unlikely to return results for models including these terms, though if it does return an answer it will agree with other methods.
* * *
Derived from of Christopher Adolph, "Problem Set 4", *POLS/CSSS 503*, University of Washington, Spring 2014. <http://faculty.washington.edu/cadolph/503/503hw4.pdf>. Used with permission.
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.