Skip to content

Latest commit

 

History

History
133 lines (92 loc) · 4.82 KB

salary.md

File metadata and controls

133 lines (92 loc) · 4.82 KB

Example: predictors of white-collar salaries

In this walk-through, we'll look at whether there seems to be a "wage gap" at a tech firm between male and female employees with similar qualifications. We will use multiple regression to adjust for the effect of education and experience in evaluating the correlation between an employee's sex and his or her annual salary.

Learning goals:

  • fit a multiple regression model
  • correctly interpret the estimated coefficients
  • quantify uncertainty about parameters in a multiple-regression model using bootstrapping

Data files:

  • salary.csv: human-resources data on employees at a tech firm.

First load the mosaic library and read in the data.

library(mosaic)

The variables we'll use from this data set are:

  • Salary: annual salary in dollars
  • Experience: months of experience at the particular company
  • Months: total months of work experience, including all previous jobs
  • Sex: whether the employee is male or female

Let's first Look at the distibution of salary by sex.

mean(Salary~Sex,data=salary)

##        0        1 
## 62610.45 59381.90

boxplot(Salary~Sex,data=salary, names=c("Female", "Male"))

Upon first glance, it looks as though women are paid more at this company than men, on average.

Statistical adjustment for experience.

However, does the story change if we adjust for work experience?

plot(Salary~Experience, data=salary)

lm1 = lm(Salary~Experience, data=salary)
coef(lm1)

## (Intercept)  Experience 
##  52516.6821    361.5327

We expect experienced workers to be paid more, all else being equal. How do these residuals---that is, salary adjusted for experience---look when we stratify them by sex?

boxplot(resid(lm1)~salary$Sex)

Now it looks like men are being paid more than women for an equivalent amount of work experience, since men have a positive residual, on average. The story is similar if we look at overall work experience, including jobs prior to the one with this particular company:

plot(Salary~Months, data=salary)

lm2 = lm(Salary~Months, data=salary)
coef(lm2)

## (Intercept)      Months 
##  44807.1515    277.8743

The story in the residuals is similar: the distribution of adjusted salaries for men is shifted upward compared to that for women.

boxplot(resid(lm2)~salary$Sex)

Fitting a multiple regression model by least squares

To get at the partial relationship between gender and salary, we must fit multiple-regression model that accounts for experience with the company and total number of months of professional work. We will also adjust for a third variable: years of post-secondary education. It is straightforward to fit such a model by least squares in R.

lm3 = lm(Salary ~ Experience + Months + Education + Sex, data=salary)
coef(lm3)

## (Intercept)  Experience      Months   Education         Sex 
##  39305.7117    122.2467    263.5782    591.0780   2320.5438

According to this model, men are paid $2320 more per year than women with similar levels of education and work experience, both overall and with this particular company.

Bootstrapping a multiple regression model

We can quantify our uncertainty about this effect via bootstrapping:

boot3 = do(5000)*{
  lm(Salary~Experience+Months+Education+Sex, data=resample(salary))
}
hist(boot3$Sex)

confint(boot3)

##         name         lower        upper level     method     estimate
## 1  Intercept 35027.9930011 4.478963e+04  0.95 percentile 3.930571e+04
## 2 Experience    42.7356224 1.949082e+02  0.95 percentile 1.222467e+02
## 3     Months   237.4831449 2.908904e+02  0.95 percentile 2.635782e+02
## 4  Education  -725.1925067 1.462953e+03  0.95 percentile 5.910780e+02
## 5        Sex   162.2853888 4.357776e+03  0.95 percentile 2.320544e+03
## 6      sigma  1969.3240366 2.955173e+03  0.95 percentile 2.672043e+03
## 7  r.squared     0.9164583 9.668175e-01  0.95 percentile 9.380802e-01
## 8          F   104.2157427 2.767955e+02  0.95 percentile 1.439242e+02

In this case, the bootstrapped confidence interval runs from about $200 to about $4300. (You'll get slightly different confidence intervals than shown here, because of the Monte Carlo variability inherent to bootstrapping.) This is quite a wide range: we cannot rule out that the wage gap is quite small, but nor can we rule out that it might run into the thousands of dollars.