regression_in_R.Rmd

---
title: "Linear Regression Functions in R"
output: html_document
---
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: { extensions: ["autoload-all.js"] }
});
</script>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This tutorial uses the following libraries
```{r message = FALSE}
library("ggplot2")
library("dplyr")
library("car")
library("broom")
```


## Regression

### lm

This example makes use of the [Duncan Occpuational prestige](http://www.rdocumentation.org/packages/car/functions/Duncan) included in the [car](https://cran.r-project.org/web/packages/car/index.html) package.
This is data from a classic sociology paper and contains data on the prestige and other characteristics of 45 U.S. occupations in 1950.

```{r}
data("Duncan", package = "car")
```
The dataset `Duncan` contains four variables: `type`, `income`, `education`, and `prestige`,
```{r}
glimpse(Duncan)
```

You run a regression in R using the function `lm`.
This runs a linear regression of occupational prestige on income,
```{r}
lm(prestige ~ income, data = Duncan)
```
This estimates the linear regression
$$
\mathtt{prestige} = \beta_0 + \beta_1 \mathtt{income}
$$
In R, $\beta_0$ is named `(Intercept)`, and the other coefficients are named after the associated predictor. 

The function `lm` returns an `lm` object  that can be used in future computations.
Instead of printing the regression result to the screen, save it to the variable `mod1`,
```{r}
mod1 <- lm(prestige ~ income, data = Duncan)
```
We can print this object
```{r}
print(mod1)
```
Somewhat counterintuitively, the `summary` function returns **more** information about a regression,
```{r}
summary(mod1)
```
The `summary` function also returns an object that we can use later,
```{r}
summary_mod1 <- summary(mod1)
summary_mod1
```

Now lets estimate a multiple linear regression,
```{r}
mod2 <- lm(prestige ~ income + education + type, data = Duncan)
mod2
```

TODO: discusss the formula syntax in detail.

### Coefficients, Standard errors

Coefficients, $\hat{\boldsymbol{\beta}}$:
```{r}
coef(mod2)
```

Variance-covariance matrix of the coefficients, $\Var{\hat{\boldsymbol{\beta}}}$:
```{r}
vcov(mod2)
```
The standard errors of the coefficients, $\se{\hat{\boldsymbol{\beta}}}$, are the square root diagonal of the `vcov` matrix,
```{r}
sqrt(diag(vcov(mod2)))
```
This can be confirmed by comparing their values to those in the summary table,
```{r}
summary(mod2)
```


### Residuals, Fitted Values,

To get the fitted or predicted values ($\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\beta}$) from a regression,
```{r}
mod1_fitted <- fitted(mod1)
head(mod1_fitted)
```
or 
```{r}
mod1_predict <- predict(mod1)
head(mod1_predict)
```
The difference between `predict` and `fitted` is how they handle missing values in the data. Fitted values will not include predictions for missing values in the data, while `predict` will include values for 

Using `predict`, we can also predict values for new data.
For example, create a data frame with each category of `type`, and in which `income` and `education` are set to their mean values.
```{r}
Duncan_at_means <-
  data.frame(type = unique(Duncan$type),
             income = mean(Duncan$income),
             education = mean(Duncan$education))
Duncan_at_means
```
Now use this with the `newdata` argument,
```{r}
predict(mod2, newdata = Duncan_at_means)
```

To get the residuals ($\hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}}$).
```{r}
mod1_resid <- residuals(mod1)
head(mod1_resid)                        
```


### Broom

The package broom has some functions that reformat the results of statistical modeling functions (`t.test`, `lm`, etc.) to data frames that work nicer with **ggplot2**, **dplyr**, and friends.

The **broom** package has three main functions:

- `glance`: Information about the model.
- `tidy`: Information about the estimated parameters
- `augment`: The original data with estimates of the model.

`glance`: Always return a one-row data.frame that is a summary of the model: e.g. R2, adjusted R2, etc.

```{r}
glance(mod2)
```

`tidy`: Transforms into a ready-to-go data.frame the coefficients, SEs (and CIs if given), critical values, and p-values in statistical tests’ outputs

```{r}
tidy(mod2)
```

`augment`: Add columns to the original data that was modeled. This includes predictions, estandard error of the predictions, residuals, and others.
```{r}
augment(mod2) %>% head()
```

- `.fitted`: the model predictions for all observations
- `.se.fit`: the estandard error of the predictions
- `.resid`: the residuals of the predictions (acual - predicted values)
- `.sigma`:  is the standard error of the prediction.

The other columns---`.hat`, `.cooksd`, and `.std.resid` are used in regression diagnostics.

### Plotting Fitted Regression Results

Consider the regression of prestige on income,
```{r}
mod3 <- lm(prestige ~ income, data = Duncan)
```
This creates a new dataset with the column `income` and 100 observations between the min and maximum observed incomes in the Duncan dataset.
```{r}
mod3_newdata <- data_frame(income = seq(min(Duncan$income), max(Duncan$income), length.out = 100))
```
We will calculate fitted values for all these values of `income`.

```{r}
ggplot() + 
  geom_point(data = Duncan, 
             mapping = aes(x = income, y = prestige), colour = "gray75") +
  geom_line(data = augment(mod3, newdata = mod3_newdata),
             mapping = aes(x = income, y = .fitted)) +
  ylab("Prestige") +
  xlab("Income") +
  theme_minimal()
```

Now plot something similar, but for a regression with `income` interacted with `type`,
```{r}
mod4 <- lm(prestige ~ income * type, data = Duncan)
```

We want to create a dataset which has, (1) each value of `type` in the Duncan data, and (2) values spanning the range of `income` in the Duncan data.
The function `expand.grid` creates a data frame with all combinations of vectors given to it (Cartesian product).
```{r}
mod4_newdata <- expand.grid(income = seq(min(Duncan$income), max(Duncan$income), length.out = 100), type = unique(Duncan$type))
```

Now plot the fitted values evaluated at each of these values along wite original values in the data,
```{r}
ggplot() + 
  geom_point(data = Duncan, 
             mapping = aes(x = income, y = prestige, color = type)) +
  geom_line(data = augment(mod4, newdata = mod4_newdata),
             mapping = aes(x = income, y = .fitted, color = type)) +
  ylab("Prestige") +
  xlab("Income") +
  theme_minimal()
```

Running `geom_smooth` with `method = "lm"` gives similar results.
However, note that `geom_smooth` with run a **separate** regression for each group.
```{r}
ggplot(data = Duncan, aes(x = income, y = prestige, color = type)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ylab("Prestige") +
  xlab("Income") +
  theme_minimal()
```


### Plotting Residuals

TODO: