Skip to content

Commit

Permalink
Logistic regression and gradient descent
Browse files Browse the repository at this point in the history
  • Loading branch information
gpleiss committed Oct 17, 2024
1 parent ea1b16d commit 2d5e51d
Show file tree
Hide file tree
Showing 17 changed files with 2,701 additions and 2,259 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion _freeze/site_libs/revealjs/dist/theme/quarto.css

Large diffs are not rendered by default.

228 changes: 140 additions & 88 deletions schedule/slides/00-gradient-descent.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,35 @@ metadata-files:

{{< include _titleslide.qmd >}}

## Simple optimization techniques
## Motivation: maximum likelihood estimation as optimization

By the principle of maximum likelihood, we have that

$$
\begin{align*}
\hat \beta
&= \argmax_{\beta} \prod_{i=1}^n \P(Y_i \mid X_i)
\\
&= \argmin_{\beta} \sum_{i=1}^n -\log\P(Y_i \mid X_i)
\end{align*}
$$

Under the model we use for logistic regression...
$$
\begin{gathered}
\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x),
\\
h(z) = \tfrac{1}{1-e^{-z}}
\end{gathered}
$$

... we can't simply find the argmin with algebra.

## Gradient descent: the workhorse optimization algorithm

We'll see "gradient descent" a few times:

1. solves logistic regression (simple version of IRWLS)
1. solves logistic regression
1. gradient boosting
1. Neural networks

Expand Down Expand Up @@ -278,18 +301,19 @@ Check any / all of

## Stochastic gradient descent

Suppose $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$

Like if $f(\beta) = \frac{1}{n}\sum_{i=1}^n (y_i - x^\top_i\beta)^2$.

Then $f'(\beta) = \frac{1}{n}\sum_{i=1}^n f'_i(\beta) = \frac{1}{n} \sum_{i=1}^n -2x_i^\top(y_i - x^\top_i\beta)$

If $n$ is really big, it may take a long time to compute $f'$
If we're optimizing
$\argmin_\beta \sum_{i=1}^n -\log P_\beta(Y_i \mid X_i)$
then our derivative will also have an additive form
$$
\sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right]
$$

So, just sample some partition our data into mini-batches $\mathcal{M}_j$
If $n$ is really big, it may take a long time to compute this gradient

And approximate (imagine the Law of Large Numbers, use a sample to approximate the population)
$$f'(x) = \frac{1}{n}\sum_{i=1}^n f'_i(x) \approx \frac{1}{m}\sum_{i\in\mathcal{M}_j}f'_{i}(x)$$
So, just sample some partition our data $\mathcal{M} \subset \{ (X_i, Y_i)\}_{i=1}^n$
and approximate:
$$\sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right] \approx \frac{n}{\vert \mathcal M \vert}\sum_{i\in\mathcal{M}} \left[-\log P_\beta(Y_i \mid X_i) \right]$$
[(Justification: Law of Large Numbers. Use a sample to approximate the population.)]{.secondary}

## SGD

Expand All @@ -316,12 +340,19 @@ This is the workhorse for neural network optimization

## Gradient descent for Logistic regression

Suppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$, $x \in \R$.
$$
\begin{gathered}
\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x),
\\
\\
h(z) = \tfrac{1}{1-e^{-z}}
\end{gathered}
$$

I want to model $P(Y=1| X=x)$.
<!-- I want to model $P(Y=1| X=x)$. -->

I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that
$p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
<!-- I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that -->
<!-- $p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$ -->

. . .

Expand All @@ -330,107 +361,133 @@ $p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
#| fig-width: 6
#| fig-height: 4
n <- 100
a <- 2
beta <- 2
x <- runif(n, -5, 5)
logit <- function(x) 1 / (1 + exp(-x))
p <- logit(a * x)
p <- logit(beta * x)
y <- rbinom(n, 1, p)
df <- tibble(x, y)
ggplot(df, aes(x, y)) +
geom_point(colour = "cornflowerblue") +
stat_function(fun = ~ logit(a * .x))
stat_function(fun = ~ logit(beta * .x))
```



## Reminder: the likelihood
---

$$
L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
p(x) = \frac{1}{1+\exp(-ax)}
\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x)
$$

Under maximum likelihood

$$
\begin{aligned}
\ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}
= \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\
&= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\
&=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\
&=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right)
\end{aligned}
\hat \beta = \argmin_{\beta} \underbrace{
\textstyle{\sum_{i=1}^n - \log( \P_\beta(Y_i=y_i \mid X_i=x_i) )}
}_{:= -\ell(\beta)}
$$

## Reminder: the likelihood
---

$$
L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
p(x) = \frac{1}{1+\exp(-ax)}
\begin{align*}
\P_\beta(Y_i=y_i \mid X_i=X_i) &= h\left( [-1]^{1 - y_i} \beta^\top x_i \right)
\\
\\
-\ell(\beta) &= \sum_{i=1}^n -\log\left( \P_\beta(Y_i=y_i \mid X_i=X_i) \right)
\\
&= \sum_{i=1}^n \log\left( 1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right) \right)
\\
\\
-\frac{\partial \ell}{\partial \beta}
&= \sum_{i=1}^n x_i \frac{\exp\left( [-1]^{y_i} \beta^\top x_i \right)}{1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right)}
\\
&= \sum_{i=1}^n x_i \left( y_i - \P_\beta(Y_i=y_i \mid X_i=X_i) \right)
\end{align*}
$$

Now, we want the negative of this. Why?
<!-- $$ -->
<!-- L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
<!-- $$ -->

We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$)

<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i} -->
<!-- = \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\ -->
<!-- &= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\ -->
<!-- &=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\ -->
<!-- &=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right) -->
<!-- \end{aligned} -->
<!-- $$ -->

$$-\ell(y | a, x) = \frac{1}{n}\sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$
<!-- ## Reminder: the likelihood -->

## Reminder: the likelihood
<!-- $$ -->
<!-- L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
<!-- $$ -->

$$
\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
p(x) = \frac{1}{1+\exp(-ax)}
$$
<!-- Now, we want the negative of this. Why? -->

This is, in the notation of our slides $f(a)$.
<!-- We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$) -->

We want to minimize it in $a$ by gradient descent.

So we need the derivative with respect to $a$: $f'(a)$.
<!-- $$-\ell(y | a, x) = \frac{1}{n}\sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$ -->

Now, conveniently, this simplifies a lot.
<!-- ## Reminder: the likelihood -->

<!-- $$ -->
<!-- \frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
<!-- $$ -->

$$
\begin{aligned}
\frac{d}{d a} f(a) &= \frac{1}{n}\sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right)\\
&=\frac{1}{n}\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \frac{1}{n}\sum_{i=1}^n -x_i(y_i-p(x_i)).
\end{aligned}
$$
<!-- This is, in the notation of our slides $f(a)$. -->

## Reminder: the likelihood
<!-- We want to minimize it in $a$ by gradient descent. -->

$$
\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
p(x) = \frac{1}{1+\exp(-ax)}
$$
<!-- So we need the derivative with respect to $a$: $f'(a)$. -->

(Simple) gradient descent to minimize $-\ell(a)$ or maximize $L(y|a,x)$ is:
<!-- Now, conveniently, this simplifies a lot. -->

1. Input $a_1,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \frac{d}{da} -\ell(a)$.
2. For $j=1,\ 2,\ \ldots,\ j_\max$,
$$a_j = a_{j-1} - \gamma \frac{d}{da} (-\ell(a_{j-1}))$$
3. Stop if $\epsilon > |a_j - a_{j-1}|$ or $|d / da\ \ell(a)| < \epsilon$.

## Reminder: the likelihood
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \frac{d}{d a} f(a) &= \frac{1}{n}\sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right)\\ -->
<!-- &=\frac{1}{n}\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \frac{1}{n}\sum_{i=1}^n -x_i(y_i-p(x_i)). -->
<!-- \end{aligned} -->
<!-- $$ -->

$$
\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
p(x) = \frac{1}{1+\exp(-ax)}
$$
<!-- ## Reminder: the likelihood -->

<!-- $$ -->
<!-- \frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
<!-- $$ -->

---

```{r amlecorrect}
#| code-line-numbers: "1,10-11|2-3|4-9|"
amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {
a <- double(jmax) # place to hold stuff (always preallocate space)
a[1] <- a0 # starting value
## Finding $\hat\beta = \argmin_{\beta} -\ell(\beta)$ with gradient descent:

1. Input $\beta_0,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \tfrac{d \ell}{d\beta}$.
2. For $j=1,\ 2,\ \ldots,\ j_\max$,
$$\beta_j = \beta_{j-1} - \gamma \left(\tfrac{d}{d\beta}-\!\ell(\beta_{j-1}) \right)$$
3. Stop if $\epsilon > |\beta_j - \beta_{j-1}|$ or $|d\ell / d\beta\ | < \epsilon$.


```{r betamle}
beta.mle <- function(x, y, beta0, gamma = 0.5, jmax = 50, eps = 1e-6) {
beta <- double(jmax) # place to hold stuff (always preallocate space)
beta[1] <- beta0 # starting value
for (j in 2:jmax) { # avoid possibly infinite while loops
px <- logit(a[j - 1] * x)
px <- logit(beta[j - 1] * x)
grad <- mean(-x * (y - px))
a[j] <- a[j - 1] - gam * grad
if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break
beta[j] <- beta[j - 1] - gamma * grad
if (abs(grad) < eps || abs(beta[j] - beta[j - 1]) < eps) break
}
a[1:j]
beta[1:j]
}
```

Expand All @@ -439,23 +496,18 @@ amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {
## Try it:

```{r ourmle1}
round(too_big <- amle(x, y, 5, 50), 3)
round(too_small <- amle(x, y, 5, 1), 3)
round(just_right <- amle(x, y, 5, 10), 3)
too_big <- beta.mle(x, y, beta0 = 5, gamma = 50)
too_small <- beta.mle(x, y, beta0 = 5, gamma = 1)
just_right <- beta.mle(x, y, beta0 = 5, gamma = 10)
```


## Visual



```{r}
#| output-location: column
#| fig-width: 6
#| fig-height: 8
negll <- function(a) {
-a * mean(x * y) -
rowMeans(log(1 / (1 + exp(outer(a, x)))))
#| fig-width: 7
#| fig-height: 7
negll <- function(beta) {
-beta * mean(x * y) -
rowMeans(log(1 / (1 + exp(outer(beta, x)))))
}
blah <- list_rbind(
map(
Expand All @@ -470,7 +522,7 @@ ggplot(blah, aes(value, negll)) +
facet_wrap(~gamma, ncol = 1) +
stat_function(fun = negll, xlim = c(-2.5, 5)) +
scale_y_log10() +
xlab("a") +
xlab("beta") +
ylab("negative log likelihood") +
geom_vline(xintercept = tail(just_right, 1)) +
scale_colour_brewer(palette = "Set1") +
Expand Down
Loading

0 comments on commit 2d5e51d

Please sign in to comment.