Skip to content

Commit

Permalink
revise 09 lasso
Browse files Browse the repository at this point in the history
  • Loading branch information
dajmcdon committed Sep 19, 2023
1 parent d48628b commit 6e104e0
Show file tree
Hide file tree
Showing 18 changed files with 5,958 additions and 950 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions _freeze/schedule/slides/09-l1-penalties/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"hash": "ce68719ed0f65c19464f8652dc6457be",
"result": {
"markdown": "---\nlecture: \"09 L1 penalties\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 18 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Last time\n\n\nRidge regression\n: $\\min \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_2^2 \\leq s$ \n\nBest (in sample) linear regression model of size $s$\n: $\\min \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_0 \\leq s$\n\n$\\snorm{\\beta}_0$ is the number of nonzero elements in $\\beta$\n\nFinding the \"best\" linear model (of size $s$, among these predictors, in sample) is a nonconvex optimization problem (In fact, it is NP-hard)\n\nRidge regression is convex (easy to solve), but doesn't do variable selection\n\nCan we somehow \"interpolate\" to get both?\n\n\n\n## Geometry of convexity\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(mvtnorm)\nnormBall <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- data.frame(x = cos(tg)) %>%\n mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%\n gather(key = \"lab\", value = \"y\", -x)\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n df <- expand.grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)\n )\n df$z <- dmvnorm(df, mean, Sigma)\n df\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6) {\n ed <- filter(ed, x > 0, y > 0)\n for (i in 1:20) {\n ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol\n if (sum(ff) > 0) break\n tol <- 2 * tol\n }\n best <- ed[ff, ]\n best[which.max(best$z), ]\n}\n\nnbs <- list()\nnbs[[1]] <- normBall(0, 1)\nqs <- c(.5, .75, 1, 1.5, 2)\nfor (ii in 2:6) nbs[[ii]] <- normBall(qs[ii - 1])\nnbs <- bind_rows(nbs)\nnbs$lab <- factor(nbs$lab, levels = unique(nbs$lab))\nseg <- data.frame(\n lab = levels(nbs$lab)[1],\n x0 = c(-1, 0), x1 = c(1, 0), y0 = c(0, -1), y1 = c(0, 1)\n)\nlevels(seg$lab) <- levels(nbs$lab)\nggplot(nbs, aes(x, y)) +\n geom_path(size = 1.2) +\n facet_wrap(~lab, labeller = label_parsed) +\n geom_segment(data = seg, aes(x = x0, xend = x1, y = y0, yend = y1), size = 1.2) +\n theme_bw(base_family = \"\", base_size = 24) +\n coord_equal() +\n scale_x_continuous(breaks = c(-1, 0, 1)) +\n scale_y_continuous(breaks = c(-1, 0, 1)) +\n geom_vline(xintercept = 0, size = .5) +\n geom_hline(yintercept = 0, size = .5) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2]))\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/plotting-functions-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## The best of both worlds\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nnb <- normBall(1)\ned <- ellipseData()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 1)\nggplot(nb, aes(x, y)) +\n geom_path(color = red) +\n geom_contour(mapping = aes(z = z), color = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +\n theme_bw(base_family = \"\", base_size = 24) +\n geom_label(\n data = bols, mapping = aes(label = bquote(\"hat(beta)[ols]\")), parse = TRUE,\n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n geom_label(\n data = bhat, mapping = aes(label = bquote(\"hat(beta)[s]^L\")), parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\nThis regularization set...\n\n* ... is convex (computationally efficient)\n* ... has corners (performs variable selection)\n\n\n## $\\ell_1$-regularized regression\n\nKnown as \n\n* \"lasso\"\n* \"basis pursuit\"\n\nThe estimator satisfies\n\n$$\\blt = \\argmin_{ \\snorm{\\beta}_1 \\leq s} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2$$\n\n\nIn its corresponding Lagrangian dual form:\n\n$$\\bll = \\argmin_{\\beta} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1$$\n\n\n## Lasso\n\nWhile the ridge solution can be easily computed \n\n$$\\brl = \\argmin_{\\beta} \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_2^2 = (\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1} \\X^{\\top}\\y$$\n\n\nthe lasso solution\n\n\n$$\\bll = \\argmin_{\\beta} \\frac 1n\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1 = \\; ??$$\n\ndoesn't have a closed form solution.\n\n\nHowever, because the optimization problem is convex, there exist efficient algorithms for computing it\n\n(The best are Iterative Soft Thresholding or Coordinate Descent. Gradient Descent doesn't work very well in practice.)\n\n\n## Coefficient path: ridge vs lasso\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(glmnet)\ndata(prostate, package = \"ElemStatLearn\")\nX <- prostate %>% dplyr::select(-train,-lpsa) %>% as.matrix()\nY <- prostate$lpsa\nlasso <- glmnet(x = X, y = Y) # alpha = 1 by default\nridge <- glmnet(x = X, y = Y, alpha = 0)\nop <- par()\npar(mfrow = c(1, 2), mar = c(5, 3, 3, .1))\nplot(lasso, main = \"Lasso\")\nplot(ridge, main = \"Ridge\")\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/ridge-v-lasso-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\n\n\n## Same but against Lambda\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npar(mfrow = c(1, 2), mar = c(5, 3, 3, .1))\nplot(lasso, main = \"Lasso\", xvar = \"lambda\")\nplot(ridge, main = \"Ridge\", xvar = \"lambda\")\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/ridge-v-lasso-again-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n## Why does Lasso select variables?\n\nSuppose I have solutions for $\\widehat{\\beta}_j$, $j = 1,\\ldots,j-1, j+1,\\ldots,p$.\n\nLet $\\widehat{\\y}_{-j} = \\X_{-j}\\widehat{\\beta}_{-j}$.\n\nOne can show that:\n\n$$\n\\widehat{\\beta}_j = S\\left(\\mathbf{X}^\\top_j(\\y - \\widehat{\\y}_{-j}),\\ \\lambda\\right).\n$$\n\n$$\nS(z, \\gamma) = \\textrm{sign}(z)(|z| - \\gamma)_+ = \\begin{cases} z - \\gamma & z > \\gamma\\\\\nz + \\gamma & z < -\\gamma \\\\ 0 & |z| \\leq \\gamma \\end{cases}\n$$\n\nIterating over this is called [coordinate descent]{.secondary} and gives the solution.\n\n::: aside\nSee for example, <https://doi.org/10.18637/jss.v033.i01>\n:::\n\n\n::: notes\n* If I were told all the other coefficient estimates.\n* Then to find this one, I'd shrink when the gradient is big, or set to 0 if it\ngets too small.\n:::\n\n## Packages\n\nThere are two main `R` implementations for finding lasso\n\n\n`{glmnet}`: `lasso.out = glmnet(X, Y, alpha=1)`. \n\n* Setting `alpha = 0` gives ridge regression (as does `lm.ridge` in the `MASS` package)\n* Setting `alpha` $\\in (0,1)$ gives a method called the \"elastic net\" which combines ridge regression and lasso, more on that next lecture.\n* If you don't specify `alpha`, it does lasso\n\n`{lars}`: `lars.out = lars(X, Y)`\n* `lars` also does other things called \"Least angle\" and \"forward stagewise\" in addition to \"forward stepwise\" regression\n\n\n## (lots of others, but these are the biggies)\n\n1. `lars` (this one came first)\n\n2. `glmnet` (this one is faster)\n\nUse different algorithms, but both compute the solution for a range of $\\lambda$.\n\n`lars` starts with an empty model and adds coefficients until saturated. The sequence of $\\lambda$'s comes from the nature of the optimization problem.\n\n`glmnet` starts with an empty model and examines each value of $\\lambda$ using previous values as \"warm starts\". It is generally much faster than `lars` and uses lots of other tricks (as well as compiled code) for extra speed.\n\nThe path returned by `lars` is more useful than that returned by `glmnet`.\n\n. . .\n\nBut you should use `glmnet`.\n\n\n\n\n\n\n\n## Choosing the lambda\n\nYou have to choose $\\lambda$ in lasso or in ridge regression\n\nlasso selects variables (by setting coefficients to zero), but the value of $\\lambda$ determines how many/which.\n\nAll of these packages come with CV built in.\n\nHowever, the way to do it differs from package to package\n\n. . .\n\n<p align=\"center\"><iframe src=\"https://giphy.com/embed/fYfeQAOD8pSjN7M0jY\" width=\"480\" height=\"270\" frameBorder=\"0\" class=\"giphy-embed\"></iframe></p>\n\n\n\n## `glmnet` version (lasso or ridge)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# 1. Estimate cv and model at once, no formula version\nlasso.glmnet <- cv.glmnet(X, Y) \n# no good reason to call glmnet() itself\n# 2. Look at the CV curve\n# 3. If the dashed lines are at the boundaries, redo with better lambda\nbest.lam <- lasso.glmnet$lambda.min \n# the value, not the location (or use lasso$lambda.1se)\n# 4. Return the coefs/predictions for the best model\ncoefs.glmnet <- coefficients(lasso.glmnet, s = \"lambda.min\")\npreds.glmnet <- predict(lasso.glmnet, newx = X, s = \"lambda.1se\") \n# must supply `newx`\n```\n:::\n\n\n* $\\widehat{R}_{CV}$ is an estimator of $R_n$, it has bias and variance\n* Because we did CV, we actually have 10 $\\widehat{R}$ values, 1 per split.\n* Calculate the mean (that's what we've been using), but what about SE?\n\n## `glmnet` version (lasso or ridge)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npar(mfrow = c(1, 2), mar = c(5, 3, 3, 0))\nplot(lasso.glmnet) # a plot method for the cv fit\nplot(lasso.glmnet$glmnet.fit) # the glmnet.fit == glmnet(X,Y)\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Paths with chosen lambda\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nridge.glmnet <- cv.glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum\npar(mfrow = c(1, 4))\nplot(ridge.glmnet, main = \"Ridge\")\nplot(lasso.glmnet, main = \"Lasso\")\nplot(ridge.glmnet$glmnet.fit, main = \"Ridge\")\nabline(v = sum(abs(coef(ridge.glmnet)))) # defaults to `lambda.1se`\nplot(lasso.glmnet$glmnet.fit, main = \"Lasso\")\nabline(v = sum(abs(coef(lasso.glmnet)))) # again, `lambda.1se` unless told otherwise\n```\n\n::: {.cell-output-display}\n![](09-l1-penalties_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n---\n\n## Degrees of freedom\n\nRemember: Lasso is **not** a linear smoother. There is no matrix $S$ such that $\\widehat{y} = Sy$ for the predicted values from lasso.\n\n* We can't use `cv_nice()`.\n\n* We don't have $\\tr{S} = \\textrm{df}$ because there is no $S$.\n\nHowever,\n\n* One can show that $\\textrm{df}_\\lambda = \\E[\\#(\\widehat{\\beta}_\\lambda \\neq 0)] = \\E[||\\widehat{\\beta}_\\lambda||_0]$\n\n* The proof is PhD-level material\n\nNote that the $\\widehat{\\textrm{df}}_\\lambda$ is plotted on the CV plot and that `lasso.glmnet$glmnet.fit$df` contains this value for all $\\lambda$.\n\n## Other flavours\n\nThe elastic net\n: generally used for correlated variables that\ncombines a ridge/lasso penalty. Included in `{glmnet}` (0 < `alpha` < 1). \n\nGrouped lasso\n: where variables are included or excluded in groups. Required for factors (1-hot encoding)\n\nRelaxed lasso\n: Takes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Included in `glmnet` with `relax = TRUE`\n\nDantzig selector\n: a slightly modified version of the lasso\n\n## Lasso cinematic universe\n\n::: flex\n::: w-60\n\nSCAD\n: a non-convex version of lasso that adds a more severe variable selection penalty\n\n$\\sqrt{\\textrm{lasso}}$\n: claims to be tuning parameter free (but isn't). Uses $||\\cdot||_2$\ninstead of $||\\cdot||_2^2$ for the loss.\n\nGeneralized lasso\n: Adds various additional matrices to the penalty term (ie: $||D\\beta||_1$. \n\n:::\n\n::: w-40\n\n![](https://sportshub.cbsistatic.com/i/2022/08/10/d348f903-585f-4aa6-aebc-d05173761065/brett-goldstein-hercules.jpg)\n\n:::\n:::\n\n\n# Next time...\n\nWhat happens when we're tired of all this linearity.\n",
"supporting": [
"09-l1-penalties_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {
"include-after-body": [
"\n<script>\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n</script>\n\n"
]
},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading

0 comments on commit 6e104e0

Please sign in to comment.