Empirical example: returns to education \hfill ARE212: Section 05 \ \hline \bigskip

We’ll start section with some brief comments on the problem set and a plug for knitr. Then, we’ll look at the efficiency gains from a specialized case of generalized least squares using simulated data. Next we’ll take one more crack at dummy variable intuition and talk briefly about interaction terms. With remaing time, we’ll try out an empirical example to demonstrate the use of categorical dummies. Along the way, we’ll apply some more advanced tools of the trade, in particular using ggplot2 and exporting R tables in \LaTeX{} format.

Problem set retrospective

In general, you guys did really well on the problem set. I’ve put the answer key on bSpace and will be handing your problem sets back today at the end of section. Many students found it useful to use \LaTeX{} and knitr (in RStudio) to format their assignments. This is great! The basic idea of knitr is that you can answer math or non-code questions using text or \LaTeX{} and then “knit” in your R code and/or results[fn:: sweave is similar — in fact, knitr is an update of sorts to sweave.] in the same file; this means that you don’t have to maintain separate files for your notes, results, and code. This process is a form of “Literate Programming”[fn:: A term coined originally by the computer scientist Donald Knuth], and it can be very useful for developing research ideas, collaborating on projects, and, of course, writing problem sets. If you’re interested in learning how to use knitr, have a look at ps1_answers.rnw on bSpace — it’s the code Max and I used to generate the answer key.

Efficiency gains from GLS

This section will briefly outline two general concepts, GLS and ggplot. We will examine the characteristics of generalized least squares (GLS), and specifically the efficiency gains from a special case of GLS, weighted least squares (WLS). We will then recreate the graphs from Figures 2.6 and 2.7, roughly, in the notes using ggplot2 a very popular graphing package in R.

Let $x ∼ U(0,2000)$ and $ε ∼ N(0,400 ⋅ \frac{1}{100} x_i^2)$, where $σ^2$ = 400 and $w_i(x_i) = \frac{1}{100} x^2$.[fn:: Careful readers will note that putting some constants into $σ^2$ and other constants in $w_i(x_i)$ is totally arbitrary, since different choices in this regard will result in weighting matrices C that differ only by a constant multiple. Since these are weighting matrices, this will not change our result.] The underlying data generating process in (2.102) is $y_i = α + x_i β + ε_i$, where $α = 0.5$ and $β = 1.5$. The objective is to plot the simulated sampling distribution of the OLS estimator applied to $B = 1000$ draws, each of size $n = 1000$. After creating a slightly more complex OLS function that returns standard errors, we’ll create our simulated population.

set.seed(42)
OLS <- function(y,X) {
  n <- nrow(X); k <- ncol(X)
  b <- solve(t(X) %*% X) %*% t(X) %*% y; names(b) <- "Estimate"
  e <- y - X %*% b; s2 <- t(e) %*% e / (n - k); XpXinv <- solve(t(X) %*% X)
  se <- sqrt(s2 * diag(XpXinv)); names(se) <- "Std. Error"
  return(data.frame(b,se))
}

pop.n <- 1000000
sigma <- 400
pop.x <- runif(pop.n, min = 0, max = 2000)
pop.w <- (1/100) * pop.x^2
pop.eps <- rnorm(pop.n, mean = 0, sd = sqrt(sigma * pop.w))
pop.y <- 0.5 + pop.x*1.5 + pop.eps

Note that since rnorm() takes the standard deviation as its third argument, so we pass it the root of the variance we want to generate. Now we’ll pull 1000 observations at random from our population — this is our sample. With these, we can calculate the standard OLS parameter vector $[\ah \hspace{6pt} \bh]^′$ by noting that $\X$ is just the $x$ vector bound to a column of ones. We will only examine $\hat{β}$ for this section, rather than both parameters.

n <- 1000
indices <- sample(1:pop.n,n,replace=F)
x <- pop.x[indices]
y <- pop.y[indices]
X <- cbind(1, x) # add an intercept

b <- OLS(y,X)[ , 1]
print(b[2])

[1] 1.557292

The sample() function samples randomly from a vector, here without replacement. We create x by using requesting a randomly sampled set of indices from pop.x. Let’s package this into a function, called rnd.beta, so that we can collect the OLS parameter for an arbitrary number of random samples.

rnd.beta <- function(i) {
  indices <- sample(1:pop.n,n,replace=F)
  x <- pop.x[indices];  y <- pop.y[indices]
  X <- cbind(1, x) # add an intercept
  b <- OLS(y,X)[ , 1]
  return(b[2])
}

Since there aren’t any supplied arguments, the function will return an estimated $\bh$ from a different random sample for each call:

rnd.beta()
rnd.beta()

[1] 1.638114
[1] 1.438083

We could do this in a for loop (like last time), but for pedagogical purposes and to be more R-ish, this time we’ll use sapply to apply the function to a list of effective indices. Now replicating the process for $B$ draws is straightforward:

B <- 1000
beta.vec <- sapply(1:B, rnd.beta)
head(beta.vec)
mean(beta.vec)

[1] 1.574734 1.536203 1.594345 1.437183 1.399252 1.348865
[1] 1.498432

All right. Looking good. The average of the simulated sample is much closer to $β$ than almost any individual call of rnd.beta, suggesting that the distribution of the simulated parameters will be unbiased. Now, let’s create another, similar function that returns the WLS estimates. Note that we follow the notes exactly in creating C, which is a diagonal matrix with $\frac{1}{\sqrt(w_i(x_i))} = \frac{1}{\sqrt(\frac{1}{100} x_i^2)} = \frac{10}{x_i}$ on the diagonals.

rnd.wls.beta <- function(i) {
  indices <- sample(1:pop.n,n,replace=F)
  x <- pop.x[indices]
  y <- pop.y[indices]
  w <- pop.w[indices]
  C <- diag(1 / sqrt(w)) # equivalent to: C <- diag(10 / x)
  y.wt <- C %*% y
  X.wt <- C %*% cbind(1, x)
  b.wt <- OLS(y.wt,X.wt)[ , 1]
  return(b.wt[2])
}
wls.beta.vec <- sapply(1:B, rnd.wls.beta)
head(wls.beta.vec)

[1] 1.175240 1.569361 1.340202 2.207822 0.821469 1.822597

\newpage We now have two collections of parameter estimates, one based on OLS and another based on WLS. It is straightforward to plot two separate histograms using R’s core histogram plotting function hist(). However, we can use this to introduce a more flexible, powerful graphing package called ggplot2.

library(ggplot2)
labels <- c(rep("ols", B), rep("wls", B))
data <- data.frame(beta=c(beta.vec, wls.beta.vec), method=labels)
ggplot(data, aes(x=beta, fill=method)) + geom_density(alpha=0.2)

As in the notes, WLS is clearly the more efficient estimator. \newpage

Interpreting dummy variables and interaction terms

Simple example: suppose you’re trying to figure out the wage effect of being married, being left-handed, and being married and left-handed. You would would want to run the following model:

$$ w_i = α + β_1 D^1_i + β_2 D^2_i + β_3 D^1_i × D^2_i + ε_i $$

Where $w_i$ is your wage, $D^1_i = 1$ if the person is married, and $D^2_i = 1$ if the person is left-handed. The easiest way to think about the interpretation of your mode is by considering what happens to the estimated model about when you set the dummy variables to 1 or 0.

For example, suppose you were interested in the average wage for unmarried ($D^1_i = 0$), left-handed people ($D^2_i = 1$). The second and fourth terms in the RHS drop from the equation, giving that $w_i = α + β_2 + ε_i$. Since we’re interested in the average and since the error term is mean zero, we can infer that the average wage for this group is $α + β_2$. We can do the same for other groups as follows:

Group	$D^1_i$	$D^2_i$	Average wage
Un-married, right-handed	0	0	$α$
Married, right-handed	1	0	$α + β_1$
Unmarried, left-handed	0	1	$α + β_2$
Married, left-handed	1	1	$α + β_1 + β_2 + β_3$

Interpretation

But what about interpreting the $β_1$, $β_2$, and $β_3$ coefficients themselves? Perhaps the most intuitive way to think about them is that each one is the “effect” of turning on the associated dummy variable. Accordingly, the following is (hopefully) logical.

$α$: The average wage for the omitted category (unmarried, right-handed).
$β_1$: The wage “effect” of being married for right-handed people.
$β_2$: The wage “effect” of being left-handed for unmarried people.
$β_3$: The wage “effect” of being both married and left-handed.

This is where my brain usually starts to hurt. If you just care about the wage “effect” of being married, then you’re running the wrong regression, since this specification is designed to examine the heterogenous “effects”. You could back out the average “effect” using the coefficients you have and the percentages of each population in your data, but it would be much easier to just run the specification without the interaction term.

Why did you put effect in quotes?

Unless our data is from a randomized control trial in which we randomly assigned marriage and left-handedness to our subject population[fn:: Sounds like a fun experiment.], the coefficients we estimate aren’t true effects. With a run-of-the-mill observational data set, the coefficients just represent the correlation between wage and the dummy variables.

Returns to education example

The purpose of this part of section is to estimate the returns to education using R. There is nothing causal about the results found here; but the empirical application gives us a chance to explore categorical dummies and to use ggplot2 package again. First, we load the required libraries.

library(foreign)
library(xtable)

We can then read the wage data directly from the online repository for the supplementary data sets for the Wooldridge (2002) text. You will need an internet connection. We only need the wage, educ, and age variables, and we omit all observations with missing observations using na.omit().

f <- "http://fmwww.bc.edu/ec-p/data/wooldridge/wage2.dta"
data <- read.dta(f)
data <- data[ , c("wage", "educ", "age")]
data <- na.omit(data)

A quick visualization reveals the distribution of wages in the data set:

ggplot(data, aes(x=wage)) + geom_histogram(colour="black", fill="#FF6666", alpha=0.6)

Before we continue, I’ll introduce the %in% operator, since we’ll be using it shortly. The %in% operator generates a boolean vector[fn:: A boolean vector is a vector composed entirely of TRUE and/or FALSE values.], depending on whether or not the element in the variable that preceeds is contained within the variable that follows it. That’s a confusing explanation. To make this more clear, here’s an example:

4:12 %in% 1:30
c(5:7) %in% c(1,1,2,3,5,8,13)
c("green","blue","red") %in% c("blue","green")

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1]  TRUE FALSE FALSE
[1]  TRUE  TRUE FALSE

We can use %in% along with the ifelse() command to easily create dummy variables from variables that take on more than two values, like educ. We can now create a rough measure of educational attainment from the educ variable.

e1 <- ifelse(data$educ %in% 1:12, 1, 0)
e2 <- ifelse(data$educ %in% 13:14, 1, 0)
e3 <- ifelse(data$educ %in% 15:16, 1, 0)
e4 <- ifelse(data$educ %in% 17:999, 1, 0)

The categorical education variables sum to one, and the lm() function will force-drop one of the variables.

xtable(coef(summary(lm(wage ~ 1 + e1 + e2 + e3 + e4, data = data))))

I snuck in xtable() to format the output from lm(). xtable() outputs R data in a beautified \LaTeX{} table format, and it’s sufficiently powerful to produce publication-quality tables[fn:: stargazer is an even more powerful package designed to create \LaTeX{} output.]. It works quite well for output from lm(), and we’ll see soon that with some tweaks it will work for the output from our own OLS() function as well.

Note that the intercept in this regression reflects the mean wage of the e4 class. The other coefficients reflect the wages of the other three classes relative to e4. What if we run the same model using OLS()?

X <- cbind(1,e1,e2,e3,e4)
y <- data$wage
b <- OLS(y,X)[ , 1]

Error in solve.default(t(X) %*% X) (from #3) :
  system is computationally singular: reciprocal condition number = 1.53843e-18

Uh oh! What happened? OLS(), sadly, is not as smart as lm(), so it didn’t automatically drop any of our dummy variables. Since the dummies sum to a column vector of ones, we violated A2: X does not have full column rank. We have a couple of options here: first, we can try dropping the intercept.

\newpage

xtable(b_dropint <- OLS(y,X[ , 2:5]))

We can show that the coefficients are just the average wage amongst each dummy group. Think of each dummy here as forming its own intercept. Since there are no other covariates, each captures the average wage for all of the observations in that group. We can also choose to omit a different group than group 4. Here, we can choose to drop dummy group 3 and to keep the intercept. What do the intercept and coefficients represent now?

xtable(b_drop3 <- OLS(y,X[ , c(1,2,3,5)]))

We could similarly compute the differences represented here by calculating the difference in mean wage across groups. For example, comparing the mean wages of groups 3 and 2.

mean3 <- mean(data[e3 == 1, c("wage")]); mean2 <- mean(data[e2 == 1, c("wage")]);
cbind(mean3, mean2, mean3 - mean2)

        mean3    mean2
[1,] 1106.451 966.9877 139.4636

This equality only holds because there are no other covariates in the regression. If we condition on age, for example, then the simple addition does not yield an average wage. For illustration, consider the previous regression with age and squared age as cofactors.

wage <- data$wage; age <- data$age; age2 <- age^2; names(age2) <- "age^2"
xtable(OLS(wage,cbind(1,e2,e3,e4,age,age2)))

It looks like age has a positive but diminishing effect on wage. This makes sense, maybe, but the coefficients are not significantly different from zero. Why might this be the case? This is where some non-parametric graphing comes in handy.

\clearpage

(g <- ggplot(data, aes(x=age, y=wage)) + geom_smooth(method="loess", size=1.5))

We use the ggplot2 package instead of the base R plotting facilities. The plots reveal a reasonable relationship between wage and age, but there is a significant amount of variation in wage, relative to the short time frame of age. We can see this by adding the actual data to this graph.

(g <- g + geom_point())

Puzzle (sort of)

We will replicate the results of a sort of silly study that examines the causes of McCarthyism, using a path model. The study was published in the American Political Science Review by Gibson (1988) and reprinted in Freedman (2009) to illustrate path models, which are essentially a simple graphical framework to keep track of direct and indirect causation. The path model is recreated in Figure (\ref{fig:path}), and shows that tolerance scores of both the masses and elites directly impact repression. This is a theoretical framework. The unlabeled connection between the tolerance nodes indicates association rather than causation. The repression and tolerance scores have been standardized, so that they have mean equal to zero and standard deviation equal to one.

The purpose of this exercise is to build up an intuition of the relationship between the OLS estimates and covariate correlations.

\begin{figure}[t] \centering

\begin{picture}(150,150)(0,0)

\put(0,18){$\var{mass tolerance}$} \put(0,127){$\var{elite tolerance}$} \put(110,72){$\var{repression}$}

\put(10,30){\circle*{5}} \put(10,120){\circle*{5}} \put(100,75){\circle*{5}}

\thicklines

\put(10,30){\vector(2,1){87}} \put(10,120){\vector(2,-1){87}}

\thinlines

\qbezier(8,36)(0,75)(8,114)

\end{picture}

\caption{Path model, causes of McCarthyism, reproduced from Freedman (2009)}

\label{fig:path} \end{figure}

Gibson reports the correlation matrix for the path diagram:

	Masses	Elite	Repress
Masses	1.00	0.52	-0.26
Elite	0.52	1.00	-0.42
Repress	-0.26	-0.42	1.00

His model for political repression for $n = 36$ states is given by: \begin{equation} \var{repression} = β_1 ⋅ \var{mass tolerance} + β_2 ⋅ \var{elite tolerance} + ε, \label{eq:one} \end{equation} Denote $\var{mass tolerance}$ as $M$, $\var{elite tolerance}$ as $E$, and $\var{repression}$ as $R$, such that Equation (\ref{eq:one}) becomes $R = β_1 M + β_2 E + ε$. Finally, let $\X = [M \hspace{8pt} E]$, so that $R = \X\beta + ε$.

Here is the kicker. Since, all variables were standardized, we know that \[ \frac{1}{n} ∑_i=1^n E_i = 0 \qquad \and \qquad \frac{1}{n}∑_i=1^n E_i^2 = 1 \]

This is true, also, for $M$ and $R$. Being careful about matrix multiplication, we can compute the following: \begin{equation} \Xp\X = \left[ \begin{array}{cc} \sumi M_i^2 & \sumi M_i E_i
\sumi M_i E_i & \sumi E_i^2 \ \end{array}\right] = n\left[ \begin{array}{cc} 1 & r_ME \ r_ME & 1 \end{array} \right]= n\left[ \begin{array}{cc} 1 & 0.52 \ 0.52 & 1 \end{array} \right] \end{equation}

\begin{equation} \Xp R = \left[ \begin{array}{cc} \sumi M_i R_i
\sumi E_i R_i \ \end{array}\right] = n\left[ \begin{array}{cc} r_MR \ r_ER \end{array} \right]= n\left[ \begin{array}{cc} -0.26 \ -0.42 \end{array} \right] \end{equation}

We don’t even need the actual data to compute the OLS coefficients! Specifically, $β = (\Xp\X)^-1\Xp R$:

n <- 36
xtx <- n * matrix(c(1, 0.52, 0.52, 1), ncol = 2)
xtr <- n * matrix(c(-0.26, -0.42), ncol = 1)
(b <- solve(xtx) %*% xtr)

            [,1]
[1,] -0.05701754
[2,] -0.39035088

Given standardized tolerance and repression scores, we can use the following formula from page 85 in Freedman to calculate the model variance: $\sigsh = 1 - \hat{β}_1^2 - \hat{β}_2^2 - 2\hat{β}_1\hat{β}_2 r_ME$

p <- 3
(sigma.hat.sq <- (n / (n - p)) * (1 - b[1]^2 - b[2]^2 - 2 * b[1] * b[2] * 0.52))

[1] 0.8958852

There is an implicit intercept, since the scores are standardized, so that $p = 3$. We compute the standard errors from the estimated covariance matrix, $\sigsh(\Xp\X)^-1$. Note that $\V(\hat{β}_k|\X) = \sigsh(\Xp\X)_kk^-1$:

vcov.mat <- sigma.hat.sq * solve(xtx)
se1 <- sqrt(vcov.mat[1,1])
se2 <- sqrt(vcov.mat[2,2])
pt(b[1]/se1, n - p)
pt(b[2]/se2, n - p)

[1] 0.3797346
[1] 0.02109715

The coefficient on $\var{mass tolerance}$ is not significant, but the coefficient on $\var{elite tolerance}$ is significant. But are the two coefficients significantly different from each other? Let $\Rb = [1 \hspace{8pt} -1]$ and $\r = [0]$. Then the following test statistic will test that the two coefficients are equal.

\begin{equation} \label{eq:F} F = \frac{(\Rb \hat{β} - \r)^′[\Rb(\Xp\X)^-1\Rbp]^-1(\Rb \hat{β} - \r)}{\sigsh} \end{equation}

R <- t(matrix(c(-1, 1))); r <- 0
G <- R %*% b - r
(F <- (G %*% R %*% solve(xtx) %*% t(R) %*% t(G))/sigma.hat.sq)

           [,1]
[1,] 0.01435461

The test statistic follows the $F$-distribution, and is not significant at any reasonable $p$-value. So, while $\var{elite tolerance}$ may be significant in the regression of repression on tolerance, it is not significantly different than the insignificant variable $\var{mass tolerance}$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sec-05.org

sec-05.org

Problem set retrospective

Efficiency gains from GLS

Interpreting dummy variables and interaction terms

Returns to education example

Puzzle (sort of)

Files

sec-05.org

Latest commit

History

sec-05.org

File metadata and controls

Problem set retrospective

Efficiency gains from GLS

Interpreting dummy variables and interaction terms

Returns to education example

Puzzle (sort of)