diff --git a/_freeze/schedule/slides/06-information-criteria/execute-results/html.json b/_freeze/schedule/slides/06-information-criteria/execute-results/html.json index 468e96e..cf8c958 100644 --- a/_freeze/schedule/slides/06-information-criteria/execute-results/html.json +++ b/_freeze/schedule/slides/06-information-criteria/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "ca5ccdd8d498af68ad54649462b93383", + "hash": "d2c70e0828ce3bfb7884aa87372960ef", "result": { "engine": "knitr", - "markdown": "---\nlecture: \"06 Information Criteria\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 23 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## LOO-CV\n\n- Train $\\hat f_1$ on all but data point 1, calculate $\\tilde R_1 = \\ell(Y_1, \\hat f_1(X_1))$.\n- Do the same for each data point $i$, calculate $\\tilde R_i$\n- Estimate $R_n \\approx \\hat R_n = \\frac{1}{n}\\sum_{i=1}^n \\tilde R_i$\n\nHas low bias 🎉 and (probably) low variance 🎉 \n\n**Not often possible to use**: requires training $n$ models 🤮\n\n## LOO-CV: Math to the rescue!\n\nConsider models where predictions are a **linear function** of the training responses, i.e.,\n\n$$ \\hat{\\mathbf y} = {\\mathbf H} {\\mathbf y} $$\n\nwhere we collected terms into matrices and vectors (${\\mathbf h_i}$ can be any functions):\n\n- $\\hat{\\mathbf y} = \\begin{bmatrix} \\hat Y_1 & \\cdots & \\hat Y_n \\end{bmatrix}^\\top \\in \\mathbb R^{n}$\n- ${\\mathbf y} = \\begin{bmatrix} Y_1 & \\cdots & Y_n \\end{bmatrix}^\\top \\in \\mathbb R^{n}$\n- $\\mathbf H = \\begin{bmatrix} \\mathbf h_1(X_1) & \\cdots & \\mathbf h_n(X_n) \\end{bmatrix}^\\top \\in \\mathbb R^{n \\times n}$\n\n. . .\n\nFor example, OLS:\n\n$$ \\hat{\\mathbf y} = {\\mathbf X} \\hat \\beta, \\qquad \\hat\\beta = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf y $$\n\nBy inspection $\\mathbf H = \\mathbf X (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top$\n\n\n## LOO-CV: Math to the rescue!\n\nFor models where predictions are a **linear function** of the training responses\\*,\n\n**LOO-CV has a closed-form expression!** Just need to fit *once*:\n\n$$\\mbox{LOO-CV} \\,\\, \\hat R_n = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}.$$\n\n- Numerator is the _squared residual_ (loss) for training point $i$.\n- Denominator weights each residual by *diagonal of $H$* some factor \n- $H_{ii}$ are *leverage/hat values*: tell you what happens when moving data point $i$ a bit\n\n\\*: plus some technicalities \n\n. . .\n\n:::callout-tip\nDeriving this sucks. I wouldn't recommend doing it yourself. \n:::\n\n## Computing the formula\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\n\n\n## What happens when we can't use the formula?\n\n(And can we get a better intuition about what's going on?)\n\n$$\n\\hat{\\mathbf y} = \\mathbf H \\mathbf y,\n\\qquad\n\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}\n$$\n\nLet's look at OLS again...\n$$ \\hat Y = X \\hat \\beta, \\qquad \\beta = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf y $$\n\nThis implies that $\\mathbf H = \\mathbf X (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top$\n\n- One really nice property of $\\mathbf H$ is that $\\tr{\\mathbf H} = p$ (where $X_n \\in \\R^p$). [(Why?)]{.secondary}\n\n## Generalizing the LOO-CV formula\n\nLet's call $\\tr{\\mathbf H} = p$ the _degrees-of-freedom_ (or just _df_) of our OLS estimator.\\\n[(Intuition: we have $p$ parameters to fit, or $p$ \"degrees of freedom\")]{.secondary}\n\n\\\n**Idea:** in our LOO-CV formula, approximate each ${\\mathbf H}_{ii}$ with the average $\\frac 1 n \\sum_{i=1}^n {\\mathbf H}_{ii}$.\\\n\\\nThen...\n\n$$\n\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-{\\mathbf H}_{ii})^2} \\approx \\frac{\\text{MSE}}{(1-\\text{df}/n)^2} \\triangleq \\text{GCV}\n$$\n\nGCV stands for **Generalized CV** estimate\n\n\n\n\n## Generalized CV\n\n$$\\textrm{GCV} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\nWe can use this formula for models that aren't of the form $\\widehat{y}_i = \\boldsymbol h_i(\\mathbf{X})^\\top \\mathbf{y}$.\\\n(Assuming we have some model-specific formula for estimating $\\textrm{df}$.)\n\n. . .\n\n### Observations\n\n- GCV > training error (Why?)\n- What happens as $n$ increases?\n- What happens as $\\text{df}$ ($p$ in our OLS model) increases?\n\n\n## Mallows $C_p$\n\nLet's see if we can generalize risk estimators from OLS ($Y \\sim \\mathcal{N}(X^T\\beta, \\sigma^2)$) in other ways. \n\nConsider the *estimation risk* of estimating $\\mu_i = X_i^T\\beta$ with $\\hat Y_i = X_i^T\\hat\\beta$: \n\n$$R_n = E\\left[\\frac{1}{n}\\sum_{i=1}^n (\\hat Y_i - \\mu_i)^2\\right]$$\n\nUsing the usual decomposition tricks:\n\n$$\nR_n= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \n= \\underbrace{\\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2}}_{\\text{train MSE}} -\n\\underbrace{\\sigma^2}_{\\text{noise}} +\n\\underbrace{\\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}}_{\\text{???}}\n$$\n\n\n\n## Mallows $C_p$\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} =\n\\underbrace{\\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2}}_{\\text{training error}} -\n\\underbrace{\\sigma^2}_{\\text{noise}} +\n\\underbrace{\\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}}_{\\text{???}}\n$$\n\nRecall that $\\widehat{\\mathbf{Y}} = \\mathbf H \\mathbf{Y}$ for some matrix $\\mathbf H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\mathbf H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\mathbf H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$$ C_p = \\text{MSE} + 2\\hat{\\sigma}^2 \\: \\textrm{df}/n $$\n\n## Mallow's $C_p$\n\n$$ C_p = \\text{MSE} + 2\\hat{\\sigma}^2 \\: \\textrm{df}/n$$\n(We derived it for the OLS model, but again it can be generalized to other models.)\n\n::: callout-important\nUnfortunately, $\\text{df}$ may be difficult or impossible to calculate for complicated\nprediction methods. But one can often estimate it well. This idea is beyond\nthe level of this course.\n:::\n\n### Observations\n- $C_p$ > training error\n- What happens as $n$ increases?\n- What happens as $\\text{df}$ ($p$ in our OLS model) increases?\n- What happens as the irreducible noise increase?\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{log-likelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{log-likelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors and $p$ predictors\n\n\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\n\n( $p+1$ because of the unknown variance, intercept included in $p$ or not)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a **low-bias / high-variance** situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a **high-bias / low-variance** situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Commentary\n\n- When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n - In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n- CV is generic, easy, and doesn't depend on unknowns.\n - But requires refitting, and nontrivial for discrete predictors, time series, etc.\n- GCV tends to choose \"dense\" models.\n- Theory says AIC chooses \"best predicting model\" asymptotically.\n- Theory says BIC chooses \"true model\" asymptotically, tends to select fewer predictors.\n- Technical: CV (or validation set) is estimating error on \n[new data]{.secondary}, unseen $(X_0, Y_0)$; AIC / CP are estimating error on [new Y]{.secondary} at the observed $x_1,\\ldots,x_n$. This is subtle.\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n**Use CV.**\n\n\n## A few more caveats\n\nTempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n[This is not always justified.]{.secondary}\n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n# Next time ...\n\nGreedy selection\n", + "markdown": "---\nlecture: \"06 Information Criteria\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 24 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## LOO-CV\n\n- Train $\\hat f_1$ on all but data point 1, calculate $\\tilde R_1 = \\ell(Y_1, \\hat f_1(X_1))$.\n- Do the same for each data point $i$, calculate $\\tilde R_i$\n- Estimate $R_n \\approx \\hat R_n = \\frac{1}{n}\\sum_{i=1}^n \\tilde R_i$\n\nHas low bias 🎉 and (probably) low variance 🎉 \n\n**Not often possible to use**: requires training $n$ models 🤮\n\n## LOO-CV: Math to the rescue!\n\nConsider models where predictions are a **linear function** of the training responses, i.e.,\n\n$$ \\hat{\\mathbf y} = {\\mathbf H} {\\mathbf y} $$\n\nwhere we collected terms into matrices and vectors (${\\mathbf h_i}$ can be any functions):\n\n- $\\hat{\\mathbf y} = \\begin{bmatrix} \\hat Y_1 & \\cdots & \\hat Y_n \\end{bmatrix}^\\top \\in \\mathbb R^{n}$\n- ${\\mathbf y} = \\begin{bmatrix} Y_1 & \\cdots & Y_n \\end{bmatrix}^\\top \\in \\mathbb R^{n}$\n- $\\mathbf H = \\begin{bmatrix} \\mathbf h_1(X_{1:n}) & \\cdots & \\mathbf h_n(X_{1:n}) \\end{bmatrix}^\\top \\in \\mathbb R^{n \\times n}$\n\n. . .\n\nFor example, OLS:\n\n$$ \\hat{\\mathbf y} = {\\mathbf X} \\hat \\beta, \\qquad \\hat\\beta = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf y $$\n\nBy inspection $\\mathbf H = \\mathbf X (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top$\n\n\n## LOO-CV: Math to the rescue!\n\nFor models where predictions are a **linear function** of the training responses\\*,\n\n**LOO-CV has a closed-form expression!** Just need to fit *once*:\n\n$$\\mbox{LOO-CV} \\,\\, \\hat R_n = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}.$$\n\n- Numerator is the _squared residual_ (loss) for training point $i$.\n- Denominator weights each residual by *diagonal of $H$* some factor \n- $H_{ii}$ are *leverage/hat values*: tell you what happens when moving data point $i$ a bit\n\n\\*: plus some technicalities \n\n. . .\n\n:::callout-tip\nDeriving this sucks. I wouldn't recommend doing it yourself. \n:::\n\n## Computing the formula\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\n\n\n## What happens when we can't use the formula?\n\n(And can we get a better intuition about what's going on?)\n\n$$\n\\hat{\\mathbf y} = \\mathbf H \\mathbf y,\n\\qquad\n\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}\n$$\n\nLet's look at OLS again...\n$$ \\hat Y = X \\hat \\beta, \\qquad \\beta = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf y $$\n\nThis implies that $\\mathbf H = \\mathbf X (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top$\n\n- One really nice property of $\\mathbf H$ is that $\\tr{\\mathbf H} = p$ (where $X_n \\in \\R^p$). [(Why?)]{.secondary}\n\n## Generalizing the LOO-CV formula\n\nLet's call $\\tr{\\mathbf H} = p$ the _degrees-of-freedom_ (or just _df_) of our OLS estimator.\\\n[(Intuition: we have $p$ parameters to fit, or $p$ \"degrees of freedom\")]{.secondary}\n\n\\\n**Idea:** in our LOO-CV formula, approximate each ${\\mathbf H}_{ii}$ with the average $\\frac 1 n \\sum_{i=1}^n {\\mathbf H}_{ii}$.\\\n\\\nThen...\n\n$$\n\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-{\\mathbf H}_{ii})^2} \\approx \\frac{\\text{MSE}}{(1-\\text{df}/n)^2} \\triangleq \\text{GCV}\n$$\n\nGCV stands for **Generalized CV** estimate\n\n\n\n\n## Generalized CV\n\n$$\\textrm{GCV} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\nWe can use this formula for models that aren't of the form $\\widehat{y}_i = \\boldsymbol h_i(\\mathbf{X})^\\top \\mathbf{y}$.\\\n(Assuming we have some model-specific formula for estimating $\\textrm{df}$.)\n\n. . .\n\n### Observations\n\n- GCV > training error (Why?)\n- What happens as $n$ increases?\n- What happens as $\\text{df}$ ($p$ in our OLS model) increases?\n\n\n## Mallows $C_p$\n\nLet's see if we can generalize risk estimators from OLS ($Y \\sim \\mathcal{N}(X^T\\beta, \\sigma^2)$) in other ways. \n\nConsider the *estimation risk* of estimating $\\mu_i = X_i^T\\beta$ with $\\hat Y_i = X_i^T\\hat\\beta$: \n\n$$R_n = E\\left[\\frac{1}{n}\\sum_{i=1}^n (\\hat Y_i - \\mu_i)^2\\right]$$\n\nUsing the usual decomposition tricks:\n\n$$\nR_n= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \n= \\underbrace{\\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2}}_{\\text{train MSE}} -\n\\underbrace{\\sigma^2}_{\\text{noise}} +\n\\underbrace{\\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}}_{\\text{???}}\n$$\n\n\n\n## Mallows $C_p$\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} =\n\\underbrace{\\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2}}_{\\text{training error}} -\n\\underbrace{\\sigma^2}_{\\text{noise}} +\n\\underbrace{\\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}}_{\\text{???}}\n$$\n\nRecall that $\\widehat{\\mathbf{Y}} = \\mathbf H \\mathbf{Y}$ for some matrix $\\mathbf H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\mathbf H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\mathbf H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$$ C_p = \\text{MSE} + 2\\hat{\\sigma}^2 \\: \\textrm{df}/n $$\n\n## Mallow's $C_p$\n\n$$ C_p = \\text{MSE} + 2\\hat{\\sigma}^2 \\: \\textrm{df}/n$$\n(We derived it for the OLS model, but again it can be generalized to other models.)\n\n::: callout-important\nUnfortunately, $\\text{df}$ may be difficult or impossible to calculate for complicated\nprediction methods. But one can often estimate it well. This idea is beyond\nthe level of this course.\n:::\n\n### Observations\n- $C_p$ > training error\n- What happens as $n$ increases?\n- What happens as $\\text{df}$ ($p$ in our OLS model) increases?\n- What happens as the irreducible noise increase?\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{log-likelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{log-likelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors and $p$ predictors\n\n\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\n\n( $p+1$ because of the unknown variance, intercept included in $p$ or not)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a **low-bias / high-variance** situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a **high-bias / low-variance** situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Commentary\n\n- When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n - In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n- CV is generic, easy, and doesn't depend on unknowns.\n - But requires refitting, and nontrivial for discrete predictors, time series, etc.\n- GCV tends to choose \"dense\" models.\n- Theory says AIC chooses \"best predicting model\" asymptotically.\n- Theory says BIC chooses \"true model\" asymptotically, tends to select fewer predictors.\n- Technical: CV (or validation set) is estimating error on \n[new data]{.secondary}, unseen $(X_0, Y_0)$; AIC / CP are estimating error on [new Y]{.secondary} at the observed $x_1,\\ldots,x_n$. This is subtle.\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n**Use CV.**\n\n\n## A few more caveats\n\nTempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n[This is not always justified.]{.secondary}\n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n# Next time ...\n\nGreedy selection\n", "supporting": [ "06-information-criteria_files" ], diff --git a/schedule/slides/06-information-criteria.qmd b/schedule/slides/06-information-criteria.qmd index f0e78d6..9057f58 100644 --- a/schedule/slides/06-information-criteria.qmd +++ b/schedule/slides/06-information-criteria.qmd @@ -27,7 +27,7 @@ where we collected terms into matrices and vectors (${\mathbf h_i}$ can be any f - $\hat{\mathbf y} = \begin{bmatrix} \hat Y_1 & \cdots & \hat Y_n \end{bmatrix}^\top \in \mathbb R^{n}$ - ${\mathbf y} = \begin{bmatrix} Y_1 & \cdots & Y_n \end{bmatrix}^\top \in \mathbb R^{n}$ -- $\mathbf H = \begin{bmatrix} \mathbf h_1(X_1) & \cdots & \mathbf h_n(X_n) \end{bmatrix}^\top \in \mathbb R^{n \times n}$ +- $\mathbf H = \begin{bmatrix} \mathbf h_1(X_{1:n}) & \cdots & \mathbf h_n(X_{1:n}) \end{bmatrix}^\top \in \mathbb R^{n \times n}$ . . .