Logistic regression and gradient descent

UBC-STAT · Oct 17, 2024 · 2d5e51d · 2d5e51d
1 parent ea1b16d
commit 2d5e51d
Show file tree

Hide file tree

Showing 17 changed files with 2,701 additions and 2,259 deletions.
diff --git a/_freeze/schedule/slides/00-gradient-descent/execute-results/html.json b/_freeze/schedule/slides/00-gradient-descent/execute-results/html.json
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/generate-data-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/generate-data-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-11-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-11-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-14-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-14-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-17-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-17-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-18-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-18-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-2-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-2-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-4-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-6-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-6-1.svg
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-8-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-8-1.svg
diff --git a/_freeze/schedule/slides/16-logistic-regression/execute-results/html.json b/_freeze/schedule/slides/16-logistic-regression/execute-results/html.json
diff --git a/_freeze/schedule/slides/16-logistic-regression/figure-revealjs/plot-d1-1.svg b/_freeze/schedule/slides/16-logistic-regression/figure-revealjs/plot-d1-1.svg
diff --git a/...ze/schedule/slides/16-logistic-regression/figure-revealjs/unnamed-chunk-1-1.svg b/...ze/schedule/slides/16-logistic-regression/figure-revealjs/unnamed-chunk-1-1.svg
diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css
diff --git a/schedule/slides/00-gradient-descent.qmd b/schedule/slides/00-gradient-descent.qmd
@@ -7,12 +7,35 @@ metadata-files:
 
 {{< include _titleslide.qmd >}}
 
-## Simple optimization techniques
+## Motivation: maximum likelihood estimation as optimization
 
+By the principle of maximum likelihood, we have that
+
+$$
+\begin{align*}
+\hat \beta
+&= \argmax_{\beta} \prod_{i=1}^n \P(Y_i \mid X_i)
+\\
+&= \argmin_{\beta} \sum_{i=1}^n -\log\P(Y_i \mid X_i)
+\end{align*}
+$$
+
+Under the model we use for logistic regression...
+$$
+\begin{gathered}
+\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x),
+\\
+h(z) = \tfrac{1}{1-e^{-z}}
+\end{gathered}
+$$
+
+... we can't simply find the argmin with algebra.
+
+## Gradient descent: the workhorse optimization algorithm
 
 We'll see "gradient descent" a few times: 
 
-1. solves logistic regression (simple version of IRWLS)
+1. solves logistic regression
 1. gradient boosting
 1. Neural networks
 
@@ -278,18 +301,19 @@ Check any / all of
 
 ## Stochastic gradient descent
 
-Suppose $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$
-
-Like if $f(\beta) = \frac{1}{n}\sum_{i=1}^n (y_i - x^\top_i\beta)^2$.
-
-Then $f'(\beta) = \frac{1}{n}\sum_{i=1}^n f'_i(\beta) = \frac{1}{n} \sum_{i=1}^n -2x_i^\top(y_i - x^\top_i\beta)$ 
-
-If $n$ is really big, it may take a long time to compute $f'$
+If we're optimizing
+$\argmin_\beta \sum_{i=1}^n -\log P_\beta(Y_i \mid X_i)$
+then our derivative will also have an additive form
+$$
+\sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right]
+$$
 
-So, just sample some partition our data into mini-batches $\mathcal{M}_j$
+If $n$ is really big, it may take a long time to compute this gradient
 
-And approximate (imagine the Law of Large Numbers, use a sample to approximate the population)
-$$f'(x) = \frac{1}{n}\sum_{i=1}^n f'_i(x) \approx \frac{1}{m}\sum_{i\in\mathcal{M}_j}f'_{i}(x)$$
+So, just sample some partition our data $\mathcal{M} \subset \{ (X_i, Y_i)\}_{i=1}^n$
+and approximate:
+$$\sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right] \approx \frac{n}{\vert \mathcal M \vert}\sum_{i\in\mathcal{M}} \left[-\log P_\beta(Y_i \mid X_i) \right]$$
+[(Justification: Law of Large Numbers. Use a sample to approximate the population.)]{.secondary}
 
 ## SGD
 
@@ -316,12 +340,19 @@ This is the workhorse for neural network optimization
 
 ## Gradient descent for Logistic regression
 
-Suppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$, $x \in \R$.  
+$$
+\begin{gathered}
+\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x),
+\\
+\\
+h(z) = \tfrac{1}{1-e^{-z}}
+\end{gathered}
+$$
 
-I want to model $P(Y=1| X=x)$. 
+<!-- I want to model $P(Y=1| X=x)$.  -->
 
-I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that
-$p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
+<!-- I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that -->
+<!-- $p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$ -->
 
 . . .
 
@@ -330,107 +361,133 @@ $p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
 #| fig-width: 6
 #| fig-height: 4
 n <- 100
-a <- 2
+beta <- 2
 x <- runif(n, -5, 5)
 logit <- function(x) 1 / (1 + exp(-x))
-p <- logit(a * x)
+p <- logit(beta * x)
 y <- rbinom(n, 1, p)
 df <- tibble(x, y)
 ggplot(df, aes(x, y)) +
   geom_point(colour = "cornflowerblue") +
-  stat_function(fun = ~ logit(a * .x))
+  stat_function(fun = ~ logit(beta * .x))
 ```
 
 
 
-## Reminder: the likelihood
+---
 
 $$
-L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
-p(x) = \frac{1}{1+\exp(-ax)}
+\P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x)
 $$
 
+Under maximum likelihood
 
 $$
-\begin{aligned}
-\ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i} 
-= \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\
-&= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\
-&=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\
-&=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right)
-\end{aligned}
+\hat \beta = \argmin_{\beta} \underbrace{
+  \textstyle{\sum_{i=1}^n - \log( \P_\beta(Y_i=y_i \mid X_i=x_i) )}
+}_{:= -\ell(\beta)}
 $$
 
-## Reminder: the likelihood
+---
 
 $$
-L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
-p(x) = \frac{1}{1+\exp(-ax)}
+\begin{align*}
+\P_\beta(Y_i=y_i \mid X_i=X_i) &= h\left( [-1]^{1 - y_i} \beta^\top x_i \right)
+\\
+\\
+-\ell(\beta) &= \sum_{i=1}^n -\log\left( \P_\beta(Y_i=y_i \mid X_i=X_i) \right)
+\\
+&= \sum_{i=1}^n \log\left( 1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right) \right)
+\\
+\\
+-\frac{\partial \ell}{\partial \beta}
+&= \sum_{i=1}^n x_i \frac{\exp\left( [-1]^{y_i} \beta^\top x_i \right)}{1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right)}
+\\
+&= \sum_{i=1}^n x_i \left( y_i - \P_\beta(Y_i=y_i \mid X_i=X_i) \right)
+\end{align*}
 $$
 
-Now, we want the negative of this. Why? 
+<!-- $$ -->
+<!-- L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
+<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
+<!-- $$ -->
 
-We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$)
 
+<!-- $$ -->
+<!-- \begin{aligned} -->
+<!-- \ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}  -->
+<!-- = \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\ -->
+<!-- &= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\ -->
+<!-- &=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\ -->
+<!-- &=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right) -->
+<!-- \end{aligned} -->
+<!-- $$ -->
 
-$$-\ell(y | a, x) = \frac{1}{n}\sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$
+<!-- ## Reminder: the likelihood -->
 
-## Reminder: the likelihood
+<!-- $$ -->
+<!-- L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
+<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
+<!-- $$ -->
 
-$$
-\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
-p(x) = \frac{1}{1+\exp(-ax)}
-$$
+<!-- Now, we want the negative of this. Why?  -->
 
-This is, in the notation of our slides $f(a)$. 
+<!-- We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$) -->
 
-We want to minimize it in $a$ by gradient descent. 
 
-So we need the derivative with respect to $a$: $f'(a)$. 
+<!-- $$-\ell(y | a, x) = \frac{1}{n}\sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$ -->
 
-Now, conveniently, this simplifies a lot. 
+<!-- ## Reminder: the likelihood -->
 
+<!-- $$ -->
+<!-- \frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
+<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
+<!-- $$ -->
 
-$$
-\begin{aligned}
-\frac{d}{d a} f(a) &= \frac{1}{n}\sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right)\\
-&=\frac{1}{n}\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \frac{1}{n}\sum_{i=1}^n -x_i(y_i-p(x_i)).
-\end{aligned}
-$$
+<!-- This is, in the notation of our slides $f(a)$.  -->
 
-## Reminder: the likelihood
+<!-- We want to minimize it in $a$ by gradient descent.  -->
 
-$$
-\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
-p(x) = \frac{1}{1+\exp(-ax)}
-$$
+<!-- So we need the derivative with respect to $a$: $f'(a)$.  -->
 
-(Simple) gradient descent to minimize $-\ell(a)$ or maximize $L(y|a,x)$ is:
+<!-- Now, conveniently, this simplifies a lot.  -->
 
-1. Input $a_1,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \frac{d}{da} -\ell(a)$.
-2. For $j=1,\ 2,\ \ldots,\ j_\max$,
-$$a_j = a_{j-1} - \gamma \frac{d}{da} (-\ell(a_{j-1}))$$
-3. Stop if $\epsilon > |a_j - a_{j-1}|$ or $|d / da\  \ell(a)| < \epsilon$.
 
-## Reminder: the likelihood
+<!-- $$ -->
+<!-- \begin{aligned} -->
+<!-- \frac{d}{d a} f(a) &= \frac{1}{n}\sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right)\\ -->
+<!-- &=\frac{1}{n}\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \frac{1}{n}\sum_{i=1}^n -x_i(y_i-p(x_i)). -->
+<!-- \end{aligned} -->
+<!-- $$ -->
 
-$$
-\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
-p(x) = \frac{1}{1+\exp(-ax)}
-$$
+<!-- ## Reminder: the likelihood -->
+
+<!-- $$ -->
+<!-- \frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and } -->
+<!-- p(x) = \frac{1}{1+\exp(-ax)} -->
+<!-- $$ -->
+
+---
 
-```{r amlecorrect}
-#| code-line-numbers: "1,10-11|2-3|4-9|"
-amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {
-  a <- double(jmax) # place to hold stuff (always preallocate space)
-  a[1] <- a0 # starting value
+## Finding $\hat\beta = \argmin_{\beta} -\ell(\beta)$ with gradient descent:
+
+1. Input $\beta_0,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \tfrac{d \ell}{d\beta}$.
+2. For $j=1,\ 2,\ \ldots,\ j_\max$,
+$$\beta_j = \beta_{j-1} - \gamma \left(\tfrac{d}{d\beta}-\!\ell(\beta_{j-1}) \right)$$
+3. Stop if $\epsilon > |\beta_j - \beta_{j-1}|$ or $|d\ell / d\beta\ | < \epsilon$.
+
+
+```{r betamle}
+beta.mle <- function(x, y, beta0, gamma = 0.5, jmax = 50, eps = 1e-6) {
+  beta <- double(jmax) # place to hold stuff (always preallocate space)
+  beta[1] <- beta0 # starting value
   for (j in 2:jmax) { # avoid possibly infinite while loops
-    px <- logit(a[j - 1] * x)
+    px <- logit(beta[j - 1] * x)
     grad <- mean(-x * (y - px))
-    a[j] <- a[j - 1] - gam * grad
-    if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break
+    beta[j] <- beta[j - 1] - gamma * grad
+    if (abs(grad) < eps || abs(beta[j] - beta[j - 1]) < eps) break
   }
-  a[1:j]
+  beta[1:j]
 }
 ```
 
@@ -439,23 +496,18 @@ amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {
 ## Try it:
 
 ```{r ourmle1}
-round(too_big <- amle(x, y, 5, 50), 3)
-round(too_small <- amle(x, y, 5, 1), 3)
-round(just_right <- amle(x, y, 5, 10), 3)
+too_big <- beta.mle(x, y, beta0 = 5, gamma = 50)
+too_small <- beta.mle(x, y, beta0 = 5, gamma = 1)
+just_right <- beta.mle(x, y, beta0 = 5, gamma = 10)
 ```
 
-
-## Visual
-
-
-
 ```{r}
 #| output-location: column
-#| fig-width: 6
-#| fig-height: 8
-negll <- function(a) {
-  -a * mean(x * y) -
-    rowMeans(log(1 / (1 + exp(outer(a, x)))))
+#| fig-width: 7
+#| fig-height: 7
+negll <- function(beta) {
+  -beta * mean(x * y) -
+    rowMeans(log(1 / (1 + exp(outer(beta, x)))))
 }
 blah <- list_rbind(
   map(
@@ -470,7 +522,7 @@ ggplot(blah, aes(value, negll)) +
   facet_wrap(~gamma, ncol = 1) +
   stat_function(fun = negll, xlim = c(-2.5, 5)) +
   scale_y_log10() + 
-  xlab("a") + 
+  xlab("beta") + 
   ylab("negative log likelihood") +
   geom_vline(xintercept = tail(just_right, 1)) +
   scale_colour_brewer(palette = "Set1") +