Skip to content

Commit

Permalink
NN slide update
Browse files Browse the repository at this point in the history
  • Loading branch information
gpleiss committed Nov 7, 2024
1 parent 5305062 commit 2b2b222
Show file tree
Hide file tree
Showing 4 changed files with 168 additions and 175 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
{
"hash": "1764e2e9b2555daeb42521f17b1bf76f",
"hash": "e8f3066c53064833ee8e08cf1ccfd9ce",
"result": {
"markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n- taking atomic units known as [neurons]{.tertiary},\n which can be \"on\" or \"off\"\n- putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall nonparametric regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1. We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k \\phi_k(x)$ some basis \n$\\phi_1,\\phi_2,\\ldots$\n\n2. Truncate this expansion at $K$: \n$f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k \\phi_k(x)$\n\n3. Estimate $\\beta_k$ with least squares\n\n\n## Recall nonparametric regression\n\nThe weaknesses of this approach are:\n\n- The basis is fixed and independent of the data\n- If $p$ is large, then nonparametrics doesn't work well at all (recall the Curse of Dimensionality)\n- If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n large to capture the structure\n- What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\sum_{k=1}^K \\beta_k \\ g(w_k^{\\top}x)\\\\\n&= \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g( w_k^{\\top}x)}$$\nThe main components are\n\n- The derived features ${A_k = g(w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n- The function $g$ is called the [activation function]{.secondary} (more on this later)\n- The parameters\n${\\beta_k},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n- The number of hidden units ${K}$ is a tuning\n parameter\n \n$$f(x) = \\sum_{k=1}^{{K}} \\beta_0 + {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n- Could add $\\beta_0$ and $w_{k0}$. Called [biases]{.secondary} \n(I'm going to ignore them. It's just an intercept) \n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g(w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n<br/>\n\n$\\beta \\in \\R^k$ \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$ \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability \n$\\longrightarrow\\ \\widehat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$). \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$ \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
"engine": "knitr",
"markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 November 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n- taking atomic units known as [neurons]{.tertiary},\n which can be \"on\" or \"off\"\n- putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall basis regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1. We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k \\phi_k(x)$ some basis \n$\\phi_1,\\phi_2,\\ldots$\n\n2. Truncate this expansion at $K$: \n$f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k \\phi_k(x)$\n\n3. Estimate $\\beta_k$ with least squares\n\n\n## Recall basis regression\n\nThe weaknesses of this approach are:\n\n- The basis is fixed and independent of the data\n- If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n large to capture the structure\n- What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\sum_{k=1}^K \\beta_k \\ g(w_k^{\\top}x)\\\\\n&= \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g( w_k^{\\top}x)}$$\nThe main components are\n\n- The derived features ${A_k = g(w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n- The function $g$ is called the [activation function]{.secondary} (more on this later)\n- The parameters\n${\\beta_k},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n- The number of hidden units ${K}$ is a tuning\n parameter\n \n$$f(x) = \\sum_{k=1}^{{K}} \\beta_0 + {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n- Could add $\\beta_0$ and $w_{k0}$. Called [biases]{.secondary} \n(I'm going to ignore them. It's just an intercept) \n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g(w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n<br/>\n\n$\\beta \\in \\R^k$ \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$ \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability \n$\\longrightarrow\\ \\widehat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$). \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$ \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n## (Nonlinear) activation functions\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Effect of depth\n\n![](gfx/nn_depth_features.png){fig-align=\"center\"}\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
"supporting": [
"21-nnets-intro_files"
],
Expand Down
Loading

0 comments on commit 2b2b222

Please sign in to comment.