diff --git a/_freeze/schedule/slides/03-regression-function/execute-results/html.json b/_freeze/schedule/slides/03-regression-function/execute-results/html.json index 9efcc01..e173a24 100644 --- a/_freeze/schedule/slides/03-regression-function/execute-results/html.json +++ b/_freeze/schedule/slides/03-regression-function/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "b8e89c3f4ea98d5a08e37ab46a3fed40", + "hash": "49396cb836d82b52f0f82e92a5e5e6a1", "result": { "engine": "knitr", - "markdown": "---\nlecture: \"03 The regression function\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 14 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Mean squared error (MSE)\n\nLast time... [Ordinary Least Squares]{.secondary}\n\n$$\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.$$\n\n\"Find the $\\beta$ which minimizes the sum of squared errors.\"\n\n$$\\widehat\\beta = \\arg\\min_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.$$\n\n\"Find the beta which minimizes the mean squared error.\"\n\n\n## Forget all that...\n\nThat's \"stuff that seems like a good idea\"\n\nAnd it is for many reasons\n\nThis class is about those reasons, and the \"statistics\" behind it\n\n. . . \n\n


\n\n#### Methods for \"Statistical\" Learning\n\n\nStarts with \"what is a model?\"\n\n## What is a model?\n\nIn statistics, \"model\" has a mathematical meaning.\n\nDistinct from \"algorithm\" or \"procedure\".\n\nDefining a model often leads to a procedure/algorithm with good properties.\n\nSometimes procedure/algorithm $\\Rightarrow$ a specific model.\n\n> Statistics (the field) tells me how to understand when different procedures\n> are desirable and the mathematical guarantees that they satisfy.\n\nWhen are certain models appropriate?\n\n> One definition of \"Statistical Learning\" is the \"statistics behind the procedure\".\n\n## Statistical models 101\n\nWe observe data $Z_1,\\ Z_2,\\ \\ldots,\\ Z_n$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n> A [statistical model]{.secondary} is a set of distributions $\\mathcal{P}$.\n\n\nSome examples:\n\n 1. $\\P = \\{P : P(Z=1)=p,\\ P(Z=0)=1-p, 0 \\leq p \\leq 1\\}$.\n 2. $\\P = \\{P : Y | X \\sim N(X^\\top\\beta,\\sigma^2),\\ \\beta \\in \\R^p,\\ \\sigma>0\\}$ (here $Z = (Y,X)$)\n 2. $\\P = \\{P \\mbox{ given by any CDF }F\\}$.\n 3. $\\P = \\{P : E[Y | X] = f(X) \\mbox{ for some smooth } f: \\R^p \\rightarrow \\R\\}$ (here $Z = (Y,X)$)\n \n## Statistical models \n\nWe want to use the data to [select]{.secondary} a distribution $P$ that probably \ngenerated the data.\n\n. . . \n\n#### My model:\n\n$$\n\\P = \\{P: P(z=1)=p,\\ P(z=0)=1-p,\\ 0 < p < 1 \\}\n$$\n \n* To completely characterize $P$, I just need to estimate $p$.\n\n* Need to assume that $P \\in \\P$. \n\n* This assumption is mostly empty: _need independent, can't see $z=12$._\n\n## Statistical models \n\nWe observe data $(Y, X)$ generated by some probability\ndistribution $P$ on $\\R \\times \\R^p$. We want to use the data to learn about $P$. \n\n. . . \n\n#### My model\n\n$$\n\\P = \\{P : Y | X \\sim N(X^\\top\\beta,\\ \\sigma^2), \\beta \\in \\R^p, \\sigma>0\\}.\n$$\n\n \n* To completely characterize *the $Y|X$-conditional of* $P$, I just need to estimate $\\beta$ and $\\sigma$.\n - I'm not interested in learning *the $X$-marginal of $P$*\n\n* Need to assume that $P\\in\\P$.\n\n* This time, I have to assume a lot more: \n_(conditional) linearity, independence, conditional Gaussian noise,_\n_no ignored variables, no collinearity, etc._\n\n\n## Statistical models, unfamiliar example\n\nWe observe data $Z \\in \\R$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n#### My model\n\n$$\n\\P = \\{P : Z \\textrm{ has a density function } f \\}.\n$$\n\n \n* To completely characterize $P$, I need to estimate $f$.\n\n* In fact, we can't hope to do this.\n\n\n[Revised Model 1]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int (f'')^2 dx < M \\}$\n\n[Revised Model 2]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int (f'')^2 dx < K < M \\}$\n\n[Revised Model 3]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int |f'| dx < M \\}$\n\n* Each of these suggests different ways of estimating $f$\n\n\n## Assumption Lean Regression\n\nImagine $Z = (Y, \\mathbf{X}) \\sim P$ with $Y \\in \\R$ and $\\mathbf{X} = (1, X_1, \\ldots, X_p)^\\top$.\n\nWe are interested in the _conditional_ distribution $P_{Y|\\mathbf{X}}$\n\nSuppose we think that there is _some_ function of interest which relates $Y$ and $X$.\n\nLet's call this function $\\mu(\\mathbf{X})$ for the moment. How do we estimate $\\mu$? What is $\\mu$?\n\n::: aside\nSee [Berk et al. _Assumption Lean Regression_](https://doi.org/10.1080/00031305.2019.1592781).\n:::\n\n\n. . . \n\nTo make this precise, we \n\n* Have a model $\\P$.\n* Need to define a \"good\" functional $\\mu$.\n* Let's loosely define \"good\" as\n\n> Given a new (random) $Z$, $\\mu(\\mathbf{X})$ is \"close\" to $Y$.\n\n## Evaluating \"close\"\n\nWe need more functions.\n \nChoose some _loss function_ $\\ell$ that measures how close $\\mu$ and $Y$ are.\n\n\n::: flex\n\n::: w-50\n\n* _Squared-error:_ \n$\\ell(y,\\ \\mu) = (y-\\mu)^2$\n\n* _Absolute-error:_ \n$\\ell(y,\\ \\mu) = |y-\\mu|$\n\n* _Zero-One:_ \n$\\ell(y,\\ \\mu) = I(y\\neq\\mu)=\\begin{cases} 0 & y=\\mu\\\\1 & \\mbox{else}\\end{cases}$\n\n* _Cauchy:_ \n$\\ell(y,\\ \\mu) = \\log(1 + (y - \\mu)^2)$\n\n:::\n\n::: w-50\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot() +\n xlim(-2, 2) +\n geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +\n geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +\n geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +\n geom_line(\n data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)), \n aes(x, y), colour = orange, linewidth = 2) +\n geom_point(data = tibble(x = 0, y = 0), aes(x, y), \n colour = orange, pch = 16, size = 3) +\n ylab(bquote(\"\\u2113\" * (y - mu))) + xlab(bquote(y - mu))\n```\n\n::: {.cell-output-display}\n![](03-regression-function_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n:::\n:::\n\n\n## Start with (Expected) Squared Error\n\nLet's try to minimize the _expected_ squared error (MSE).\n\nClaim: $\\mu(X) = \\Expect{Y\\ \\vert\\ X}$ minimizes MSE.\n\nThat is, for any $r(X)$, $\\Expect{(Y - \\mu(X))^2} \\leq \\Expect{(Y-r(X))^2}$.\n\n\n. . .\n\nProof of Claim:\n\n\n\\begin{aligned}\n\\Expect{(Y-r(X))^2} \n&= \\Expect{(Y- \\mu(X) + \\mu(X) - r(X))^2}\\\\\n&= \\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2\\Expect{(Y- \\mu(X))(\\mu(X) - r(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2(\\mu(X) - r(X))\\Expect{(Y- \\mu(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} + 0\\\\\n&\\geq \\Expect{(Y- \\mu(X))^2}\n\\end{aligned}\n\n\n\n\n## The regression function\n\nSometimes people call this solution:\n\n\n$$\\mu(X) = \\Expect{Y \\ \\vert\\ X}$$\n\n\nthe regression function. (But don't forget that it depended on $\\ell$.)\n\nIf we [assume]{.secondary} that $\\mu(x) = \\Expect{Y \\ \\vert\\ X=x} = x^\\top \\beta$, then we get back exactly OLS.\n\n. . .\n\nBut why should we assume $\\mu(x) = x^\\top \\beta$?\n\n\n## Brief aside {background-color=\"#97D4E9\"}\n\nSome notation / terminology\n\n* \"Hats\" on things mean \"estimates\", so $\\widehat{\\mu}$ is an estimate of $\\mu$\n\n* Parameters are \"properties of the model\", so $f_X(x)$ or $\\mu$ or $\\Var{Y}$\n\n* Random variables like $X$, $Y$, $Z$ may eventually become data, $x$, $y$, $z$, once observed.\n\n* \"Estimating\" means \"using observations to estimate _parameters_\"\n\n* \"Predicting\" means \"using observations to predict _future data_\"\n\n* Often, there is a parameter whose estimate will provide a prediction.\n\n. . .\n\n\n> This last point can lead to confusion.\n\n\n\n## The regression function\n\n\nIn mathematics: $\\mu(x) = \\Expect{Y \\ \\vert\\ X=x}$.\n\nIn words: \n\n[Regression with squared-error loss is really about estimating the (conditional) mean.]{.secondary}\n\n1. If $Y\\sim \\textrm{N}(\\mu,\\ 1)$, our best guess for a [new]{.secondary} $Y$ is $\\mu$. \n\n2. For regression, we let the mean $(\\mu)$ [depend]{.secondary} on $X$. \n3. Think of $Y\\sim \\textrm{N}(\\mu(X),\\ 1)$, then conditional on $X=x$, our best guess for a [new]{.secondary} $Y$ is $\\mu(x)$ \n\n[whatever this function $\\mu$ is]\n\n\n## Anything strange?\n\nFor any two variables $Y$ and $X$, we can [always]{.secondary} write\n\n$$Y = E[Y\\given X] + (Y - E[Y\\given X]) = \\mu(X) + \\eta(X)$$\n\nsuch that $\\Expect{\\eta(X)}=0$.\n\n. . .\n\n* Suppose, $\\mu(X)=\\mu_0$ (constant in $X$), are $Y$ and $X$ independent?\n\n. . .\n\n* Suppose $Y$ and $X$ are independent, is $\\mu(X)=\\mu_0$?\n\n. . .\n\n* For more practice on this see the \n[Fun Worksheet on Theory](../handouts/worksheet.pdf) and\n[solutions](../handouts/worksheet-solution.pdf)\n\n* In this course, I do not expect you to be able to create this math, but understanding and explaining it [is]{.secondary} important.\n\n\n# Making predictions\n\n\n\n## What do we mean by good predictions?\n\n\nWe make observations and then attempt to \"predict\" new, unobserved data.\n\nSometimes this is the same as estimating the (conditional) mean. \n \nMostly, we observe $(y_1,x_1),\\ \\ldots,\\ (y_n,x_n)$, and we want some way to predict $Y$ from $X$.\n\n\n\n## Expected test MSE \n\n\nFor _regression_ applications, we will use squared-error loss:\n\n$R_n(\\widehat{\\mu}) = \\Expect{(Y-\\widehat{\\mu}(X))^2}$\n\n. . .\n\nI'm giving this a name, $R_n$ for ease. \n\nDifferent than text.\n\nThis is _expected test MSE_.\n\n\n\n## Example: Estimating/Predicting the (conditional) mean\n\n\nSuppose we know that we want to predict a quantity $Y$, \n\nwhere $\\Expect{Y}= \\mu \\in \\mathbb{R}$ and $\\Var{Y} = 1$. \n\n\nOur data is $\\{y_1,\\ldots,y_n\\}$\n\nClaim: We want to estimate $\\mu$. \n\n. . . \n\nWhy?\n\n\n## Estimating the mean\n\n* Let $\\widehat{Y}=\\overline{Y}_n$ be the sample mean. \n* We can ask about the _estimation risk_ (since we're estimating $\\mu$):\n\n::: flex\n\n::: w-50\n \n\\begin{aligned}\n E[(\\overline{Y}_n-\\mu)^2]\n &= E[\\overline{Y}_n^2]\n -2\\mu E[\\overline{Y}_n] + \\mu^2 \\\\ \n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 +\n \\mu^2\\\\ &= \\frac{1}{n}\n\\end{aligned}\n\n:::\n\n::: w-50\n\n[Useful trick]{.primary}\n\nFor any $Z$,\n\n$\\Var{Z} = \\Expect{Z^2} - \\Expect{Z}^2$.\n\nTherefore:\n\n$\\Expect{Z^2} = \\Var{Z} + \\Expect{Z}^2$.\n\n:::\n:::\n \n\n\n## Predicting new Y's\n\n* Let $\\widehat{Y}=\\overline{Y}_n$ be the sample mean. \n* What is the _prediction risk_ of $\\overline{Y}$? \n\n\n\n::: flex\n::: w-50\n\\begin{aligned}\n R_n(\\overline{Y}_n) \n &= \\E[(\\overline{Y}_n-Y)^2]\\\\ \n &= \\E[\\overline{Y}_{n}^{2}] -2\\E[\\overline{Y}_n Y] + \\E[Y^2] \\\\ \n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 + \\mu^2 + 1 \\\\ \n &= 1 + \\frac{1}{n} \n\\end{aligned}\n\n:::\n::: w-50\n\n[Tricks:]{.primary}\n\nUsed the variance thing again.\n\nIf $X$ and $Z$ are independent, then $\\Expect{XZ} = \\Expect{X}\\Expect{Z}$\n\n:::\n:::\n\n## Predicting new Y's\n\n \n* What is the prediction risk of guessing $Y=0$?\n\n* You can probably guess that this is a stupid idea.\n\n* Let's show why it's stupid.\n\n\\begin{aligned}\n R_n(0) &= \\E[(0-Y)^2] = 1 + \\mu^2\n\\end{aligned}\n\n\n\n## Predicting new Y's\n\n\n* What is the prediction risk of guessing $Y=\\mu$?\n\n\n* This is a great idea, but we don't know $\\mu$.\n\n* Let's see what happens anyway.\n\n\\begin{aligned}\n R_n(\\mu) &= \\E[(Y-\\mu)^2]= 1\n\\end{aligned}\n\n\n\n## Risk relations\n\n \nPrediction risk: $R_n(\\overline{Y}_n) = 1 + \\frac{1}{n}$ \n\nEstimation risk: $E[(\\overline{Y}_n - \\mu)^2] = \\frac{1}{n}$ \n\nThere is actually a nice interpretation here:\n\n1. The common $1/n$ term is $\\Var{\\overline{Y}_n}$ \n2. The extra factor of $1$ in the prediction risk is _irreducible error_\n\n * $Y$ is a random variable, and hence noisy. \n * We can never eliminate it's intrinsic variance. \n * In other words, even if we knew $\\mu$, we could never get closer than $1$, on average.\n\nIntuitively, $\\overline{Y}_n$ is the obvious thing to do.\n \nBut what about unintuitive things...\n\n\n# Next time...\n\nTrading bias and variance\n", + "markdown": "---\nlecture: \"03 The regression function\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Mean squared error (MSE)\n\nLast time... [Ordinary Least Squares]{.secondary}\n\n$$\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.$$\n\n\"Find the $\\beta$ which minimizes the sum of squared errors.\"\n\n$$\\widehat\\beta = \\arg\\min_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.$$\n\n\"Find the beta which minimizes the mean squared error.\"\n\n\n## Forget all that...\n\nThat's \"stuff that seems like a good idea\"\n\nAnd it is for many reasons\n\nThis class is about those reasons, and the \"statistics\" behind it\n\n. . . \n\n


\n\n#### Methods for \"Statistical\" Learning\n\n\nStarts with \"what is a model?\"\n\n## What is a model?\n\nIn statistics, \"model\" has a mathematical meaning.\n\nDistinct from \"algorithm\" or \"procedure\".\n\nDefining a model often leads to a procedure/algorithm with good properties.\n\nSometimes procedure/algorithm $\\Rightarrow$ a specific model.\n\n> Statistics (the field) tells me how to understand when different procedures\n> are desirable and the mathematical guarantees that they satisfy.\n\nWhen are certain models appropriate?\n\n> One definition of \"Statistical Learning\" is the \"statistics behind the procedure\".\n\n## Statistical models 101\n\nWe observe data $Z_1,\\ Z_2,\\ \\ldots,\\ Z_n$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n> A [statistical model]{.secondary} is a set of distributions $\\mathcal{P}$.\n\n\nSome examples:\n\n 1. $\\P = \\{P : P(Z=1)=p,\\ P(Z=0)=1-p, 0 \\leq p \\leq 1\\}$.\n 2. $\\P = \\{P : Y | X \\sim N(X^\\top\\beta,\\sigma^2),\\ \\beta \\in \\R^p,\\ \\sigma>0\\}$ (here $Z = (Y,X)$)\n 2. $\\P = \\{P \\mbox{ given by any CDF }F\\}$.\n 3. $\\P = \\{P : E[Y | X] = f(X) \\mbox{ for some smooth } f: \\R^p \\rightarrow \\R\\}$ (here $Z = (Y,X)$)\n \n## Statistical models \n\nWe want to use the data to [select]{.secondary} a distribution $P$ that probably \ngenerated the data.\n\n. . . \n\n#### My model:\n\n$$\n\\P = \\{P: P(z=1)=p,\\ P(z=0)=1-p,\\ 0 < p < 1 \\}\n$$\n \n* To completely characterize $P$, I just need to estimate $p$.\n\n* Need to assume that $P \\in \\P$. \n\n* This assumption is mostly empty: _need independent, can't see $z=12$._\n\n## Statistical models \n\nWe observe data $(Y, X)$ generated by some probability\ndistribution $P$ on $\\R \\times \\R^p$. We want to use the data to learn about $P$. \n\n. . . \n\n#### My model\n\n$$\n\\P = \\{P : Y | X \\sim N(X^\\top\\beta,\\ \\sigma^2), \\beta \\in \\R^p, \\sigma>0\\}.\n$$\n\n \n* To completely characterize *the $Y|X$-conditional of* $P$, I just need to estimate $\\beta$ and $\\sigma$.\n - I'm not interested in learning *the $X$-marginal of $P$*\n\n* Need to assume that $P\\in\\P$.\n\n* This time, I have to assume a lot more: \n_(conditional) linearity, independence, conditional Gaussian noise,_\n_no ignored variables, no collinearity, etc._\n\n\n## Statistical models, unfamiliar example\n\nWe observe data $Z \\in \\R$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n#### My model\n\n$$\n\\P = \\{P : Z \\textrm{ has a density function } f \\}.\n$$\n\n \n* To completely characterize $P$, I need to estimate $f$.\n\n* In fact, we can't hope to do this.\n\n\n[Revised Model 1]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int (f'')^2 dx < M \\}$\n\n[Revised Model 2]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int (f'')^2 dx < K < M \\}$\n\n[Revised Model 3]{.secondary} - $\\P=\\{ Z \\textrm{ has a density function } f : \\int |f'| dx < M \\}$\n\n* Each of these suggests different ways of estimating $f$\n\n\n## Assumption Lean Regression\n\nImagine $Z = (Y, \\mathbf{X}) \\sim P$ with $Y \\in \\R$ and $\\mathbf{X} = (1, X_1, \\ldots, X_p)^\\top$.\n\nWe are interested in the _conditional_ distribution $P_{Y|\\mathbf{X}}$\n\nSuppose we think that there is _some_ function of interest which relates $Y$ and $X$.\n\nLet's call this function $\\mu(\\mathbf{X})$ for the moment. How do we estimate $\\mu$? What is $\\mu$?\n\n::: aside\nSee [Berk et al. _Assumption Lean Regression_](https://doi.org/10.1080/00031305.2019.1592781).\n:::\n\n\n. . . \n\nTo make this precise, we \n\n* Have a model $\\P$.\n* Need to define a \"good\" functional $\\mu$.\n* Let's loosely define \"good\" as\n\n> Given a new (random) $Z$, $\\mu(\\mathbf{X})$ is \"close\" to $Y$.\n\n## Evaluating \"close\"\n\nWe need more functions.\n \nChoose some _loss function_ $\\ell$ that measures how close $\\mu$ and $Y$ are.\n\n\n::: flex\n\n::: w-50\n\n* _Squared-error:_ \n$\\ell(y,\\ \\mu) = (y-\\mu)^2$\n\n* _Absolute-error:_ \n$\\ell(y,\\ \\mu) = |y-\\mu|$\n\n* _Zero-One:_ \n$\\ell(y,\\ \\mu) = I(y\\neq\\mu)=\\begin{cases} 0 & y=\\mu\\\\1 & \\mbox{else}\\end{cases}$\n\n* _Cauchy:_ \n$\\ell(y,\\ \\mu) = \\log(1 + (y - \\mu)^2)$\n\n:::\n\n::: w-50\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot() +\n xlim(-2, 2) +\n geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +\n geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +\n geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +\n geom_line(\n data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)), \n aes(x, y), colour = orange, linewidth = 2) +\n geom_point(data = tibble(x = 0, y = 0), aes(x, y), \n colour = orange, pch = 16, size = 3) +\n ylab(bquote(\"\\u2113\" * (y - mu))) + xlab(bquote(y - mu))\n```\n\n::: {.cell-output-display}\n![](03-regression-function_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n:::\n:::\n\n\n## Start with (Expected) Squared Error\n\nLet's try to minimize the _expected_ squared error (MSE).\n\nClaim: $\\mu(X) = \\Expect{Y\\ \\vert\\ X}$ minimizes MSE.\n\nThat is, for any $r(X)$, $\\Expect{(Y - \\mu(X))^2} \\leq \\Expect{(Y-r(X))^2}$.\n\n\n. . .\n\nProof of Claim:\n\n\n\\begin{aligned}\n\\Expect{(Y-r(X))^2} \n&= \\Expect{(Y- \\mu(X) + \\mu(X) - r(X))^2}\\\\\n&= \\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2\\Expect{(Y- \\mu(X))(\\mu(X) - r(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2(\\mu(X) - r(X))\\Expect{(Y- \\mu(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} + 0\\\\\n&\\geq \\Expect{(Y- \\mu(X))^2}\n\\end{aligned}\n\n\n\n\n## The regression function\n\nSometimes people call this solution:\n\n\n$$\\mu(X) = \\Expect{Y \\ \\vert\\ X}$$\n\n\nthe regression function. (But don't forget that it depended on $\\ell$.)\n\nIf we [assume]{.secondary} that $\\mu(x) = \\Expect{Y \\ \\vert\\ X=x} = x^\\top \\beta$, then we get back exactly OLS.\n\n. . .\n\nBut why should we assume $\\mu(x) = x^\\top \\beta$?\n\n\n## Brief aside {background-color=\"#97D4E9\"}\n\nSome notation / terminology\n\n* \"Hats\" on things mean \"estimates\", so $\\widehat{\\mu}$ is an estimate of $\\mu$\n\n* Parameters are \"properties of the model\", so $f_X(x)$ or $\\mu$ or $\\Var{Y}$\n\n* Random variables like $X$, $Y$, $Z$ may eventually become data, $x$, $y$, $z$, once observed.\n\n* \"Estimating\" means \"using observations to estimate _parameters_\"\n\n* \"Predicting\" means \"using observations to predict _future data_\"\n\n* Often, there is a parameter whose estimate will provide a prediction.\n\n. . .\n\n\n> This last point can lead to confusion.\n\n\n\n## The regression function\n\n\nIn mathematics: $\\mu(x) = \\Expect{Y \\ \\vert\\ X=x}$.\n\nIn words: \n\n[Regression with squared-error loss is really about estimating the (conditional) mean.]{.secondary}\n\n1. If $Y\\sim \\textrm{N}(\\mu,\\ 1)$, our best guess for a [new]{.secondary} $Y$ is $\\mu$. \n\n2. For regression, we let the mean $(\\mu)$ [depend]{.secondary} on $X$. \n3. Think of $Y\\sim \\textrm{N}(\\mu(X),\\ 1)$, then conditional on $X=x$, our best guess for a [new]{.secondary} $Y$ is $\\mu(x)$ \n\n[whatever this function $\\mu$ is]\n\n\n## Anything strange?\n\nFor any two variables $Y$ and $X$, we can [always]{.secondary} write\n\n$$Y = E[Y\\given X] + (Y - E[Y\\given X]) = \\mu(X) + \\eta(X)$$\n\nsuch that $\\Expect{\\eta(X)}=0$.\n\n. . .\n\n* Suppose, $\\mu(X)=\\mu_0$ (constant in $X$), are $Y$ and $X$ independent?\n\n. . .\n\n* Suppose $Y$ and $X$ are independent, is $\\mu(X)=\\mu_0$?\n\n. . .\n\n* For more practice on this see the \n[Fun Worksheet on Theory](../handouts/worksheet.pdf) and\n[solutions](../handouts/worksheet-solution.pdf)\n\n* In this course, I do not expect you to be able to create this math, but understanding and explaining it [is]{.secondary} important.\n\n\n# Making predictions\n\n\n\n## What do we mean by good predictions?\n\n\nWe make observations and then attempt to \"predict\" new, unobserved data.\n\nSometimes this is the same as estimating the (conditional) mean. \n \nMostly, we observe $(y_1,x_1),\\ \\ldots,\\ (y_n,x_n)$, and we want some way to predict $Y$ from $X$.\n\n\n\n## Expected test MSE \n\n\nFor _regression_ applications, we will use squared-error loss:\n\n$R_n(\\widehat{\\mu}) = \\Expect{(Y-\\widehat{\\mu}(X))^2}$\n\n. . .\n\nI'm giving this a name, $R_n$ for ease. \n\nDifferent than text.\n\nThis is _expected test MSE_.\n\n\n\n## Example: Estimating/Predicting the (conditional) mean\n\n\nSuppose we know that we want to predict a quantity $Y$, \n\nwhere $\\Expect{Y}= \\mu \\in \\mathbb{R}$ and $\\Var{Y} = 1$. \n\n\nOur data is $\\{y_1,\\ldots,y_n\\}$\n\nWe will use the sample mean $\\overline{Y}_n$ to estimate both $\\mu$ and $Y$. \n\n\n## Estimating the mean\n\nWe evaluate the _estimation risk_ (since we're estimating $\\mu$) via:\n\n::: flex\n\n::: w-50\n \n\\begin{aligned}\n E[(\\overline{Y}_n-\\mu)^2]\n &= E[\\overline{Y}_n^2]\n -2\\mu E[\\overline{Y}_n] + \\mu^2 \\\\ \n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 +\n \\mu^2\\\\ &= \\frac{1}{n}\n\\end{aligned}\n\n:::\n\n::: w-50\n\n[Useful trick]{.primary}\n\nFor any $Z$,\n\n$\\Var{Z} = \\Expect{Z^2} - \\Expect{Z}^2$.\n\nTherefore:\n\n$\\Expect{Z^2} = \\Var{Z} + \\Expect{Z}^2$.\n\n:::\n:::\n \n\n\n## Predicting new Y's\n\nWe evaluate the _prediction risk_ of $\\overline{Y}_n$ (since we're predicting $Y$) via: \n\n::: flex\n::: w-50\n\\begin{aligned}\n R_n(\\overline{Y}_n) \n &= \\E[(\\overline{Y}_n-Y)^2]\\\\ \n &= \\E[(\\overline{Y}_n - \\mu)^2] + \\E[(\\mu-Y)^2]\\\\ \n &= \\frac{1}{n} + 1\n\\end{aligned}\n\n* $1/n$ for _estimation risk_\n* $1$ for remaining noise in $Y$\n\n:::\n::: w-50\n\n[Tricks:]{.primary}\n\nAdd and subtract $\\mu$ inside the square.\n\n$\\overline{Y}_n$ and $Y$ are independent and mean $\\mu$.\n\n:::\n:::\n\n## Predicting new Y's\n\n \n* What is the prediction risk of guessing $Y=0$?\n\n* You can probably guess that this is a stupid idea.\n\n* Let's show why it's stupid.\n\n\\begin{aligned}\n R_n(0) &= \\E[(0-Y)^2] = 1 + \\mu^2\n\\end{aligned}\n\n\n\n## Predicting new Y's\n\n\n* What is the prediction risk of guessing $Y=\\mu$?\n\n\n* This is a great idea, but we don't know $\\mu$.\n\n* Let's see what happens anyway.\n\n\\begin{aligned}\n R_n(\\mu) &= \\E[(Y-\\mu)^2]= 1\n\\end{aligned}\n\n\n\n## Risk relations\n\n \nPrediction risk: $R_n(\\overline{Y}_n) = 1 + \\frac{1}{n}$ \n\nEstimation risk: $E[(\\overline{Y}_n - \\mu)^2] = \\frac{1}{n}$ \n\nThere is actually a nice interpretation here:\n\n1. The common $1/n$ term is $\\Var{\\overline{Y}_n}$ \n2. The extra factor of $1$ in the prediction risk is _irreducible error_\n\n * $Y$ is a random variable, and hence noisy. \n * We can never eliminate it's intrinsic variance. \n * In other words, even if we knew $\\mu$, we could never get closer than $1$, on average.\n\nIntuitively, $\\overline{Y}_n$ is the obvious thing to do.\n \nBut what about unintuitive things...\n\n\n# Next time...\n\nTrading bias and variance\n", "supporting": [ "03-regression-function_files" ], diff --git a/schedule/slides/03-regression-function.qmd b/schedule/slides/03-regression-function.qmd index 9b93aac..9f12589 100644 --- a/schedule/slides/03-regression-function.qmd +++ b/schedule/slides/03-regression-function.qmd @@ -364,17 +364,12 @@ where $\Expect{Y}= \mu \in \mathbb{R}$ and $\Var{Y} = 1$. Our data is $\{y_1,\ldots,y_n\}$ -Claim: We want to estimate $\mu$. - -. . . - -Why? +We will use the sample mean $\overline{Y}_n$ to estimate both $\mu$ and $Y$. ## Estimating the mean -* Let $\widehat{Y}=\overline{Y}_n$ be the sample mean. -* We can ask about the _estimation risk_ (since we're estimating $\mu$): +We evaluate the _estimation risk_ (since we're estimating $\mu$) via: ::: flex @@ -409,29 +404,28 @@ $\Expect{Z^2} = \Var{Z} + \Expect{Z}^2$. ## Predicting new Y's -* Let $\widehat{Y}=\overline{Y}_n$ be the sample mean. -* What is the _prediction risk_ of $\overline{Y}$? - - +We evaluate the _prediction risk_ of $\overline{Y}_n$ (since we're predicting $Y$) via: ::: flex ::: w-50 \begin{aligned} R_n(\overline{Y}_n) &= \E[(\overline{Y}_n-Y)^2]\\ - &= \E[\overline{Y}_{n}^{2}] -2\E[\overline{Y}_n Y] + \E[Y^2] \\ - &= \mu^2 + \frac{1}{n} - 2\mu^2 + \mu^2 + 1 \\ - &= 1 + \frac{1}{n} + &= \E[(\overline{Y}_n - \mu)^2] + \E[(\mu-Y)^2]\\ + &= \frac{1}{n} + 1 \end{aligned} +* $1/n$ for _estimation risk_ +* $1$ for remaining noise in $Y$ + ::: ::: w-50 [Tricks:]{.primary} -Used the variance thing again. +Add and subtract $\mu$ inside the square. -If $X$ and $Z$ are independent, then $\Expect{XZ} = \Expect{X}\Expect{Z}$ +$\overline{Y}_n$ and $Y$ are independent and mean $\mu$. ::: :::