Empirical example: returns to education \hfill ARE212: Section 05 \ \hline \bigskip
We’ll start section with some brief comments on the problem set and a plug for knitr
. Then, we’ll look at the efficiency gains from a specialized case of generalized least squares using simulated data. Next we’ll take one more crack at dummy variable intuition and talk briefly about interaction terms. With remaing time, we’ll try out an empirical example to demonstrate the use of categorical dummies. Along the way, we’ll apply some more advanced tools of the trade, in particular using ggplot2
and exporting R
tables in \LaTeX{} format.
In general, you guys did really well on the problem set. I’ve put the answer key on bSpace and will be handing your problem sets back today at the end of section. Many students found it useful to use \LaTeX{} and knitr
(in RStudio
) to format their assignments. This is great! The basic idea of knitr
is that you can answer math or non-code questions using text or \LaTeX{} and then “knit” in your R
code and/or results[fn:: sweave
is similar — in fact, knitr
is an update of sorts to sweave
.] in the same file; this means that you don’t have to maintain separate files for your notes, results, and code. This process is a form of “Literate Programming”[fn:: A term coined originally by the computer scientist Donald Knuth], and it can be very useful for developing research ideas, collaborating on projects, and, of course, writing problem sets. If you’re interested in learning how to use knitr
, have a look at ps1_answers.rnw
on bSpace — it’s the code Max and I used to generate the answer key.
This section will briefly outline two general concepts, GLS and ggplot
. We will examine the characteristics of generalized least squares (GLS), and specifically
the efficiency gains from a special case of GLS, weighted least squares (WLS). We will then recreate the graphs from Figures 2.6 and 2.7, roughly, in the notes using ggplot2
a very popular graphing package in R
.
Let C
that differ only by a constant multiple. Since these are weighting matrices, this will not change our result.] The underlying data generating process in (2.102) is
set.seed(42)
OLS <- function(y,X) {
n <- nrow(X); k <- ncol(X)
b <- solve(t(X) %*% X) %*% t(X) %*% y; names(b) <- "Estimate"
e <- y - X %*% b; s2 <- t(e) %*% e / (n - k); XpXinv <- solve(t(X) %*% X)
se <- sqrt(s2 * diag(XpXinv)); names(se) <- "Std. Error"
return(data.frame(b,se))
}
pop.n <- 1000000
sigma <- 400
pop.x <- runif(pop.n, min = 0, max = 2000)
pop.w <- (1/100) * pop.x^2
pop.eps <- rnorm(pop.n, mean = 0, sd = sqrt(sigma * pop.w))
pop.y <- 0.5 + pop.x*1.5 + pop.eps
Note that since rnorm()
takes the standard deviation as its third argument, so we pass it the root of the variance we want to generate. Now we’ll pull 1000 observations at random from our population — this is our sample. With these, we can calculate the standard OLS parameter vector $[\ah \hspace{6pt} \bh]′$ by noting that
n <- 1000
indices <- sample(1:pop.n,n,replace=F)
x <- pop.x[indices]
y <- pop.y[indices]
X <- cbind(1, x) # add an intercept
b <- OLS(y,X)[ , 1]
print(b[2])
[1] 1.557292
The sample()
function samples randomly from a vector, here without replacement. We create x
by using requesting a randomly sampled set of indices from pop.x
. Let’s package this into a function, called rnd.beta
, so that we can collect the OLS parameter for an arbitrary number of random samples.
rnd.beta <- function(i) {
indices <- sample(1:pop.n,n,replace=F)
x <- pop.x[indices]; y <- pop.y[indices]
X <- cbind(1, x) # add an intercept
b <- OLS(y,X)[ , 1]
return(b[2])
}
Since there aren’t any supplied arguments, the function will return an estimated
rnd.beta()
rnd.beta()
[1] 1.638114 [1] 1.438083
We could do this in a for
loop (like last time), but for pedagogical purposes and to be more R
-ish, this time we’ll use sapply
to apply the function to a list of effective indices. Now replicating the process for
B <- 1000
beta.vec <- sapply(1:B, rnd.beta)
head(beta.vec)
mean(beta.vec)
[1] 1.574734 1.536203 1.594345 1.437183 1.399252 1.348865 [1] 1.498432
All right. Looking good. The average of the simulated sample is much closer to rnd.beta
, suggesting that the distribution of the simulated parameters will be unbiased. Now, let’s create another, similar function that returns the WLS estimates. Note that we follow the notes exactly in creating C
, which is a diagonal matrix with
rnd.wls.beta <- function(i) {
indices <- sample(1:pop.n,n,replace=F)
x <- pop.x[indices]
y <- pop.y[indices]
w <- pop.w[indices]
C <- diag(1 / sqrt(w)) # equivalent to: C <- diag(10 / x)
y.wt <- C %*% y
X.wt <- C %*% cbind(1, x)
b.wt <- OLS(y.wt,X.wt)[ , 1]
return(b.wt[2])
}
wls.beta.vec <- sapply(1:B, rnd.wls.beta)
head(wls.beta.vec)
[1] 1.175240 1.569361 1.340202 2.207822 0.821469 1.822597
\newpage
We now have two collections of parameter estimates, one based on OLS and another based on WLS. It is straightforward to plot two separate histograms using R
’s core histogram plotting function hist()
. However, we can use this to introduce a more flexible, powerful graphing package called ggplot2
.
library(ggplot2)
labels <- c(rep("ols", B), rep("wls", B))
data <- data.frame(beta=c(beta.vec, wls.beta.vec), method=labels)
ggplot(data, aes(x=beta, fill=method)) + geom_density(alpha=0.2)
As in the notes, WLS is clearly the more efficient estimator. \newpage
Simple example: suppose you’re trying to figure out the wage effect of being married, being left-handed, and being married and left-handed. You would would want to run the following model:
Where
For example, suppose you were interested in the average wage for unmarried (
Group | Average wage | ||
---|---|---|---|
Un-married, right-handed | 0 | 0 | |
Married, right-handed | 1 | 0 | |
Unmarried, left-handed | 0 | 1 | |
Married, left-handed | 1 | 1 |
Interpretation
But what about interpreting the
-
$α$ : The average wage for the omitted category (unmarried, right-handed). -
$β_1$ : The wage “effect” of being married for right-handed people. -
$β_2$ : The wage “effect” of being left-handed for unmarried people. -
$β_3$ : The wage “effect” of being both married and left-handed.
This is where my brain usually starts to hurt. If you just care about the wage “effect” of being married, then you’re running the wrong regression, since this specification is designed to examine the heterogenous “effects”. You could back out the average “effect” using the coefficients you have and the percentages of each population in your data, but it would be much easier to just run the specification without the interaction term.
Why did you put effect in quotes?
Unless our data is from a randomized control trial in which we randomly assigned marriage and left-handedness to our subject population[fn:: Sounds like a fun experiment.], the coefficients we estimate aren’t true effects. With a run-of-the-mill observational data set, the coefficients just represent the correlation between wage and the dummy variables.
The purpose of this part of section is to estimate the returns to education using R
. There is nothing causal about the results found here; but the empirical application gives us a chance to explore categorical dummies and to use ggplot2
package again. First, we load the required libraries.
library(foreign)
library(xtable)
We can then read the wage data directly from the online repository for the supplementary data sets for the Wooldridge (2002) text. You will need an internet connection. We only need the wage
, educ
, and age
variables, and we omit all observations with missing observations using na.omit()
.
f <- "http://fmwww.bc.edu/ec-p/data/wooldridge/wage2.dta"
data <- read.dta(f)
data <- data[ , c("wage", "educ", "age")]
data <- na.omit(data)
A quick visualization reveals the distribution of wages in the data set:
ggplot(data, aes(x=wage)) + geom_histogram(colour="black", fill="#FF6666", alpha=0.6)
Before we continue, I’ll introduce the %in%
operator, since we’ll be using it shortly. The %in%
operator generates a boolean vector[fn:: A boolean vector is a vector composed entirely of TRUE
and/or FALSE
values.], depending on whether or not the element in the variable that preceeds is contained within the variable that follows it. That’s a confusing explanation. To make this more clear, here’s an example:
4:12 %in% 1:30
c(5:7) %in% c(1,1,2,3,5,8,13)
c("green","blue","red") %in% c("blue","green")
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [1] TRUE FALSE FALSE [1] TRUE TRUE FALSE
We can use %in%
along with the ifelse()
command to easily create dummy variables from variables that take on more than two values, like educ
. We can now create a rough measure of educational attainment from the educ
variable.
e1 <- ifelse(data$educ %in% 1:12, 1, 0)
e2 <- ifelse(data$educ %in% 13:14, 1, 0)
e3 <- ifelse(data$educ %in% 15:16, 1, 0)
e4 <- ifelse(data$educ %in% 17:999, 1, 0)
The categorical education variables sum to one, and the lm()
function will force-drop one of the variables.
xtable(coef(summary(lm(wage ~ 1 + e1 + e2 + e3 + e4, data = data))))
I snuck in xtable()
to format the output from lm()
. xtable()
outputs R
data in a beautified \LaTeX{} table format, and it’s sufficiently powerful to produce publication-quality tables[fn:: stargazer
is an even more powerful package designed to create \LaTeX{} output.]. It works quite well for output from lm()
, and we’ll see soon that with some tweaks it will work for the output from our own OLS()
function as well.
Note that the intercept in this regression reflects the mean wage of the e4
class. The other coefficients reflect the wages of the other three classes relative to e4
. What if we run the same model using OLS()
?
X <- cbind(1,e1,e2,e3,e4)
y <- data$wage
b <- OLS(y,X)[ , 1]
Error in solve.default(t(X) %*% X) (from #3) : system is computationally singular: reciprocal condition number = 1.53843e-18
Uh oh! What happened? OLS()
, sadly, is not as smart as lm()
, so it didn’t automatically drop any of our dummy variables. Since the dummies sum to a column vector of ones, we violated A2: X
does not have full column rank. We have a couple of options here: first, we can try dropping the intercept.
\newpage
xtable(b_dropint <- OLS(y,X[ , 2:5]))
We can show that the coefficients are just the average wage amongst each dummy group. Think of each dummy here as forming its own intercept. Since there are no other covariates, each captures the average wage for all of the observations in that group. We can also choose to omit a different group than group 4. Here, we can choose to drop dummy group 3 and to keep the intercept. What do the intercept and coefficients represent now?
xtable(b_drop3 <- OLS(y,X[ , c(1,2,3,5)]))
We could similarly compute the differences represented here by calculating the difference in mean wage across groups. For example, comparing the mean wages of groups 3 and 2.
mean3 <- mean(data[e3 == 1, c("wage")]); mean2 <- mean(data[e2 == 1, c("wage")]);
cbind(mean3, mean2, mean3 - mean2)
mean3 mean2 [1,] 1106.451 966.9877 139.4636
This equality only holds because there are no other covariates in the regression. If we condition on age, for example, then the simple addition does not yield an average wage. For illustration, consider the previous regression with age
and squared age
as cofactors.
wage <- data$wage; age <- data$age; age2 <- age^2; names(age2) <- "age^2"
xtable(OLS(wage,cbind(1,e2,e3,e4,age,age2)))
It looks like age has a positive but diminishing effect on wage. This makes sense, maybe, but the coefficients are not significantly different from zero. Why might this be the case? This is where some non-parametric graphing comes in handy.
\clearpage
(g <- ggplot(data, aes(x=age, y=wage)) + geom_smooth(method="loess", size=1.5))
We use the ggplot2
package instead of the base R
plotting facilities. The plots reveal a reasonable relationship between wage and age, but there is a significant amount of variation in wage, relative to the short time frame of age. We can see this by adding the actual data to this graph.
(g <- g + geom_point())
We will replicate the results of a sort of silly study that examines
the causes of McCarthyism, using a path model. The study was
published in the American Political Science Review by Gibson (1988)
and reprinted in Freedman (2009) to illustrate path models, which are
essentially a simple graphical framework to keep track of direct and
indirect causation. The path model is recreated in Figure
(\ref{fig:path}), and shows that tolerance scores of both the masses
and elites directly impact repression. This is a theoretical
framework. The unlabeled connection between the tolerance nodes
indicates association rather than causation. The repression and
tolerance scores have been standardized, so that they have mean equal
to zero and standard deviation equal to one.
The purpose of this exercise is to build up an intuition of the relationship between the OLS estimates and covariate correlations.
\begin{figure}[t] \centering
\begin{picture}(150,150)(0,0)
\put(0,18){$\var{mass tolerance}$} \put(0,127){$\var{elite tolerance}$} \put(110,72){$\var{repression}$}
\put(10,30){\circle*{5}} \put(10,120){\circle*{5}} \put(100,75){\circle*{5}}
\thicklines
\put(10,30){\vector(2,1){87}} \put(10,120){\vector(2,-1){87}}
\thinlines
\qbezier(8,36)(0,75)(8,114)
\end{picture}
\caption{Path model, causes of McCarthyism, reproduced from Freedman (2009)}
\label{fig:path} \end{figure}
Gibson reports the correlation matrix for the path diagram:
Masses | Elite | Repress | |
---|---|---|---|
Masses | 1.00 | 0.52 | -0.26 |
Elite | 0.52 | 1.00 | -0.42 |
Repress | -0.26 | -0.42 | 1.00 |
His model for political repression for
Here is the kicker. Since, all variables were standardized, we know that \[ \frac{1}{n} ∑i=1^n E_i = 0 \qquad \and \qquad \frac{1}{n}∑i=1^n E_i^2 = 1 \]
This is true, also, for
\sumi M_i E_i & \sumi E_i^2 \
\end{array}\right] =
n\left[
\begin{array}{cc}
1 & rME \
rME & 1
\end{array}
\right]=
n\left[
\begin{array}{cc}
1 & 0.52 \
0.52 & 1
\end{array}
\right]
\end{equation}
\begin{equation}
\Xp R = \left[
\begin{array}{cc}
\sumi M_i R_i
\sumi E_i R_i \
\end{array}\right] =
n\left[
\begin{array}{cc}
rMR \
rER
\end{array}
\right]=
n\left[
\begin{array}{cc}
-0.26 \
-0.42
\end{array}
\right]
\end{equation}
We don’t even need the actual data to compute the OLS coefficients! Specifically, $β = (\Xp\X)-1\Xp R$:
n <- 36
xtx <- n * matrix(c(1, 0.52, 0.52, 1), ncol = 2)
xtr <- n * matrix(c(-0.26, -0.42), ncol = 1)
(b <- solve(xtx) %*% xtr)
[,1] [1,] -0.05701754 [2,] -0.39035088
Given standardized tolerance and repression scores, we can use the following formula from page 85 in Freedman to calculate the model variance: $\sigsh = 1 - \hat{β}_1^2 - \hat{β}_2^2 - 2\hat{β}_1\hat{β}_2 rME$
p <- 3
(sigma.hat.sq <- (n / (n - p)) * (1 - b[1]^2 - b[2]^2 - 2 * b[1] * b[2] * 0.52))
[1] 0.8958852
There is an implicit intercept, since the scores are standardized, so
that
vcov.mat <- sigma.hat.sq * solve(xtx)
se1 <- sqrt(vcov.mat[1,1])
se2 <- sqrt(vcov.mat[2,2])
pt(b[1]/se1, n - p)
pt(b[2]/se2, n - p)
[1] 0.3797346 [1] 0.02109715
The coefficient on
\begin{equation} \label{eq:F} F = \frac{(\Rb \hat{β} - \r)′[\Rb(\Xp\X)-1\Rbp]-1(\Rb \hat{β} - \r)}{\sigsh} \end{equation}
R <- t(matrix(c(-1, 1))); r <- 0
G <- R %*% b - r
(F <- (G %*% R %*% solve(xtx) %*% t(R) %*% t(G))/sigma.hat.sq)
[,1] [1,] 0.01435461
The test statistic follows the