-
Notifications
You must be signed in to change notification settings - Fork 0
/
activity-soln.Rmd
441 lines (336 loc) · 18.7 KB
/
activity-soln.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Kernel Density Estimation"
author: "Raven McKnight & Kaden Bieger"
date: "May 9, 2020"
output:
html_document:
theme: readable
toc: true
toc_float: true
code_folding: hide
---
```{r echo = FALSE}
# this code chunk installs and loads all the libraries you'll need for the activity
packages <- c('ggplot2', 'gridExtra')
miss_pkgs <- packages[!packages %in% installed.packages()[,1]]
if(length(miss_pkgs) > 0){
install.packages(miss_pkgs)
}
invisible(lapply(packages, library, character.only = TRUE))
rm(miss_pkgs, packages)
set.seed(455)
```
# Introduction
Nonparametric statistics is a rapidly developing field. It is also very different than the content we've covered in class so far! Broadly speaking, nonparametric methods allow us to relax assumptions about our data. We often assume our data comes from a normal distribution, or at the very least from a distribution with mean $\mu$ and variance $\sigma^2$. Nonparametric methods are not based on such parameters.
*Kernel density estimation* (KDE) is a common technique used to estimate probability density functions (PDFs). In practice, we rarely know much at all about the true distribution of our sampled data. KDE allows us to get an estimated PDF without making unreasonable assumptions of our data.
## Intuition
KDE is a method you've seen before even if you don't know it! The intuition behind KDE is similar to the intuition behind a simple **histogram**. Histograms can be used to visually approximate the distribution of data. There are a few reasons we want a more sophisticated method, however. First, as the plot below illustrates, histograms are very sensitive to the number and width of bins. Additionally, a histogram provides an excessively local estimate -- there is no "smoothing" between neighboring bins.
```{r}
set.seed(455)
simdat <- rnorm(n = 100, mean = 2, sd = 4)
# fn borrowed from Son Phan
hist_bins <- function(data, bins) {
ggplot(mapping = aes(x = data, y=..density..)) +
geom_histogram(bins = bins, fill = "steelblue3") +
ggtitle(sprintf(fmt = "%i bins", bins)) + theme_minimal()
}
grid.arrange(hist_bins(simdat, 10),
hist_bins(simdat, 20),
hist_bins(simdat, 40),
hist_bins(simdat, 80), nrow=2)
```
If you've ever made a **density plot**, you have even more experience with KDE. Behind the scenes, the `ggplot2` function `geom_density` employs KDE! It turns out we can set the specific **kernel** and **bandwidth** when we call `geom_density`. We'll cover kernels and bandwidths below.
# Kernel Density Estimation
The goal of kernel density estimation to to estimate the PDF of our data without knowing its distribution. We use a **kernel** as a weighting function to smooth our data in order to get this estimate.
A **kernel** is a probability density function with several additional conditions:
1. Kernels are non-negative and real-values
2. Kernels are symmetric about 0
Several familiar PDFs, including the Gaussian and Uniform PDFs, meet these requirements. Once we have a kernel selected, we can implement KDE.
Given data $x = (x_1, x_2, \ldots, x_n)$, a kernel function $K$, and a selected bandwidth $h$, the kernel density estimate at point $s$ is defined
\begin{equation}
\hat{f}_n(s) = \frac{1}{nh} \sum_{i = 1}^{n} K\left(\frac{x_i - s}{h}\right)
\end{equation}
This is the kernel density estimator at a *single* point. To estimate an entire PDF, we apply the kernel to each point in our sample. The procedure is as follows:
1. Apply the kernel function each data point in our sample
```{r, echo=FALSE}
set.seed(10)
x1 <- rnorm(30, mean = 400, sd = 400)
y1 <- rnorm(30, mean = 0, sd = 0)
simdat1 <- data.frame(x1, y1)
#for (q in 1:30){
ggplot(simdat1) +
geom_point(aes(x = x1, y = y1), color = "blue", size = 2)+
stat_function(fun = dnorm, args = list(mean = x1[1], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[2], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[3], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[4], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[5], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[6], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[7], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[8], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[9], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[10], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[11], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[12], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[13], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[14], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[15], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[16], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[17], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[18], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[19], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[20], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[21], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[22], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[23], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[24], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[25], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[26], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[27], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[28], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[29], sd = 50))+
stat_function(fun = dnorm, args = list(mean = x1[30], sd = 50))
#}
```
2. Sum all $n$ kernel functions
```{r, echo=FALSE}
eq <- function(x){
dnorm(x, mean = x1[1], sd = 50)+
dnorm(x, mean = x1[2], sd = 50)+
dnorm(x, mean = x1[3], sd = 50)+
dnorm(x, mean = x1[4], sd = 50)+
dnorm(x, mean = x1[5], sd = 50)+
dnorm(x, mean = x1[6], sd = 50)+
dnorm(x, mean = x1[7], sd = 50)+
dnorm(x, mean = x1[8], sd = 50)+
dnorm(x, mean = x1[9], sd = 50)+
dnorm(x, mean = x1[10],sd = 50)+
dnorm(x, mean = x1[11], sd = 50)+
dnorm(x, mean = x1[12], sd = 50)+
dnorm(x, mean = x1[13], sd = 50)+
dnorm(x, mean = x1[14], sd = 50)+
dnorm(x, mean = x1[15], sd = 50)+
dnorm(x, mean = x1[16], sd = 50)+
dnorm(x, mean = x1[17], sd = 50)+
dnorm(x, mean = x1[18], sd = 50)+
dnorm(x, mean = x1[19], sd = 50)+
dnorm(x, mean = x1[20], sd = 50)+
dnorm(x, mean = x1[21], sd = 50)+
dnorm(x, mean = x1[22], sd = 50)+
dnorm(x, mean = x1[23], sd = 50)+
dnorm(x, mean = x1[24], sd = 50)+
dnorm(x, mean = x1[25], sd = 50)+
dnorm(x, mean = x1[26], sd = 50)+
dnorm(x, mean = x1[27], sd = 50)+
dnorm(x, mean = x1[28], sd = 50)+
dnorm(x, mean = x1[29], sd = 50)+
dnorm(x, mean = x1[30], sd = 50)
}
ggplot(simdat1) +
geom_point(aes(x = x1, y = y1), color = "blue", size = 2)+
stat_function(fun=eq)
```
3. Divide by $n$. Because each kernel function integrates to one, the resulting KDE will still integrate to one.
```{r, echo=FALSE}
kde <- function(x){
eq(x)/30
}
ggplot(simdat1) +
geom_point(aes(x = x1, y = y1), color = "blue", size = 2)+
stat_function(fun=kde)
```
In notation, we can write this procedure as $f(x) = \frac{1}{n} \sum_{i=1}^{n} K(x - x_i)$ where $x-x_i$ centers the kernel function on each $x_i$. Note the similarity to the formal definition of a kernel density estimator above. The only additional parameter is the bandwidth!
```{r echo = FALSE}
#simulate data, make it impossible to see this code
set.seed(455)
asimdat <- rchisq(n = 100, df = 5)
asimdat <- data.frame(x = asimdat)
```
#### Acitivity 1
Consider 100 independent and identically distributed samples from an unknown distribution (plotted below). Given a bandwidth $h = 0.5$, write the Gaussian kernel density estimator for this sample at $s = 5$. We can use a standard normal with $\sigma^2 = 1$ and $\mu = 0$.
```{r}
ggplot(asimdat, aes(x=x)) +
geom_histogram(bins = 15, fill = "steelblue3") +
theme_minimal() + ggtitle("iid sample from unknown distribution")
```
Try plotting your calculate kernel density estimate below. Compare to the output of `geom_density`.
```{r}
my_est <- function(x, xi = 5, h = 0.5){
# put your estimate here!
1/((100 * h) *sqrt(2*pi)) * exp(-1/2 * ((x - xi)/h)^2)
}
# to get the whole kde estimate, sum my_est over each s
kern <- matrix(ncol = 100, nrow = 100)
kde <- function(x, x_i = 5, h = 0.5){
for(i in 1:100){
kern[i, ] <- my_est(x = asimdat$x, x_i = asimdat$x[i], h = h)
}
colSums(kern)
}
# uncomment lines in this ggplot code as you finish the functions below
ggplot(asimdat, aes(x=x)) +
theme_minimal() +
geom_density() + # generic geom_density
geom_density(bw = 0.5, kernel = "gaussian", color = "blue") + #closer to your estimator
stat_function(fun = my_est, color = "red") + # your estimate at s = 5
geom_line(aes(x = x, y = kde(x)), color = "pink") + # and your overall estimate!
NULL
```
# Choosing a Kernel Function
Kernel functions are non-negative, symmetric, and decreasing for all $x > 0$ (because they're symmetric, this also means they are increasing for all $x < 0$). Gaussian and Rectangular (uniform) are two kernel choices, though there are many others. The plot below shows kernel density estimates for the same simulated data using four common kernels. In this example, we use a fairly low bandwidth (0.3), to make differences between the kernels clear.
```{r}
gaus <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "gaussian", bw = 0.3, color = "pink", size = 1) +
ggtitle("Gaussian Kernel")
e <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "epanechnikov", bw = 0.3, color = "pink", size = 1) +
ggtitle("Epanechnikov Kernel")
tri <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "triangular", bw = 0.3, color = "pink", size = 1) +
ggtitle("Triangular Kernel")
rect <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "rectangular", bw = 0.3, color = "pink", size = 1) +
ggtitle("Rectangular Kernel")
grid.arrange(gaus, e, tri, rect, nrow=2)
```
Choice of kernel doesn't affect end result that much, although there are benefits to some kernels in particular. The Epanechnikov kernel, for example, is MSE optimal.
**Bonus Activity:** If you want to try out other kernel options inside `geom_density`, go to "Help" in Rstudio and search for "density." Scroll down to "arguments" to see all possible kernels!
# Choosing Bandwidth
The bandwidth is the width of our smoothing window -- following the histogram example, this is like the width of bins. Bandwidth is generally represented as $h$. Bandwidth is how we control the smoothness of our estimate. The figure below shows a Gaussian KDE with various bandwidths.
```{r}
gaus25 <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "gaussian", bw = 0.25, color = "pink", size = 1) +
ggtitle("Gaussian Kernel, bandwith = 0.25")
gaus50 <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "gaussian", bw = 0.50, color = "pink", size = 1) +
ggtitle("Gaussian Kernel, bandwith = 0.50")
gaus75 <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "gaussian", bw = 0.75, color = "pink", size = 1) +
ggtitle("Gaussian Kernel, bandwith = 0.75")
gaus1 <- ggplot(mapping = aes(x = simdat, y=..density..)) +
geom_histogram(bins = 10, fill = "steelblue3") + theme_minimal() +
geom_density(kernel = "gaussian", bw = 1, color = "pink", size = 1) +
ggtitle("Gaussian Kernel, bandwith = 1")
grid.arrange(gaus25, gaus50, gaus75, gaus1, nrow=2)
```
## Bandwidth tradeoffs
A small $h$ places more weight on the data, while a large $h$ performs more smoothing. We can use smaller bandwidths when $n$ is larger, particularly when the data has a small range (ie it is tightly packed). Larger $h$ is better when we have less data or more spread out observations.
There's an analogy here with Bayesian priors -- choosing $h$ lets us choose how much weight we want to give the data.
#### Activity 2
Using the kernel you defined above, plot a KDE estimate using various values for $h$. You can do so by updating the bandwidth you defined in `my_est`. For comparison, you can also use the parameter `bw` within a `geom_density`. How might you choose an optimal bandwidth?
```{r}
ggplot(asimdat, aes(x = x, y = kde(h = 2))) +
geom_line()
```
# Bias, Variance, MSE
## Bias
Naturally, we would like to consider the bias and mean squared error of estimators produced by kernel density estimation. In this section, we outline proofs deriving the expected value, bias, and mean squared error for kernel density estimates. The proofs here are somewhat simplified but largely rely on **u-substitution** and **Taylor expansions**.
Given observations $x_1, x_2, \ldots, x_n \overset{iid}{\sim} f$, we define the expected value of our estimator as follows:
$$
\mathbf{E}[\hat{f}_n(s)] = \mathbf{E}\left[\frac{1}{nh} \displaystyle \sum_{i = 1}^{n} K \left( \frac{x_i - s}{h}\right) \right] = \frac{1}{h}\mathbf{E}\left[K \left( \frac{x - s}{h}\right) \right] = \frac{1}{h} \int K\left( \frac{x - s}{h}\right) f(x) dx
$$
This first line follows simply from the definitions of kernel density estimators and expected value.
Next, we let $u = \frac{x - s}{h}$ and substitute:
$$
\mathbf{E}[\hat{f}_n(s)] = \frac{1}{h} \int K\left( u\right) f(hu + s) du
$$
Then, we apply the 2nd order Taylor expansion for $f(hu + s)$ about $h = 0$. Omitting several steps of algebra, our Taylor expansion can be written
$$
\begin{aligned}
f(hu + s) &= f(s) + \frac{f'(s)}{1!}(u)(h-0) + \frac{f''(s)}{2!}(u^2)(h-0)^2 + o(h^2) \\
&= f(s) + huf'(s) + \frac{h^2u^2}{2}f''(s) + o(h^2)
\end{aligned}
$$
where $o(h^2)$ is some function which approaches zero as $h$ approaches infinity. Technically, this $o()$ notation indications that $o(h^2)$ becomes negligible compared to $h^2$ as $h \rightarrow 0$. **For our purposes, we assume $o(h^2) \rightarrow 0$ as $h \rightarrow 0$**.
We plug the Taylor expansion into our expected value above and simplify via algebra.
$$
\begin{aligned}
\mathbf{E}[\hat{f}_n(s)] & = \int K(u) \left[f(s) + huf'(s) + \frac{h^2u^2}{2}f''(s) + o(h^2)\right] du \\ \\
& = f(s)\int K(u) du + hf'(s)\int uK(u) du + \frac{h^2}{2}f''(s)\int u^2 K(u)du + o(h^2) \\
& = f(s) + \frac{h^2}{2}f''(s)\int u^2 K(u)du + o(h^2)
\end{aligned}
$$
We can plug this into the definition of bias such that
$$
\begin{aligned}
\textbf{Bias}(\hat{f}_n(s)) &= E[\hat{f}_n(s)] - f(s) = \frac{h^2}{2}f''(s)\int u^2 K(u)du + o(h^2) \\
&= \frac{t \cdot h^2}{2}f''(s) + o(h^2)
\end{aligned}
$$
where $t = \int u^2 K(u)du$.
#### Activity 3
What happens to bias as bandwidth $h$ increases/decreases?
## Variance
The proof for variance is very similar to the proof for bias. This proof is possible because $K$ is symmetric about 0. Feel free to work through this proof more on your own!
$$
\begin{aligned}
\mathbf{Var}(\hat{f}_n(s))
&=
\mathbf{Var} \left(\frac{1}{nh} \displaystyle \sum_{i = 1}^{n} K \left(\frac{x_i - s}{h}\right)\right) \\
&= \frac{1}{nh^2} \left(\mathbf{E}\left[ K^2 \left( \frac{x - s}{h}\right) \right] - \mathbf{E}\left[ K \left( \frac{x - s}{h}\right)\right]^2\right)\\
&\leq
\frac{1}{nh^2} \mathbf{E}\left[ K^2 \left( \frac{x - s}{h}\right) \right] \\
&=
\frac{1}{nh^2} \int K^2 \left( \frac{x - s}{h}\right)f(x) dx
\end{aligned}
$$
As above, we substitute $u = \frac{x-s}{h}$ and plug in a 1st order Taylor expansion of $f(hu + s)$.
$$
\begin{aligned}
\mathbf{Var}(\hat{f}_n(s))
&\leq
\frac{1}{nh^2} \int K^2(u)f(hu + s)hdu \\
&=
\frac{1}{nh} \int K^2(u)f(hu + s)du \\
&=
\frac{1}{nh} \int K^2(u)[f(s) + huf'(s) + o(h)]du \\
&=
\frac{1}{nh} \bigg(f(s)\int K^2(u) du + hf'(s)\int uK^2(u) du + o(h)\bigg) \\
\mathbf{Var}(\hat{f}_n(s)) &\leq \frac{f(s)}{nh}\int K^2(u) du + o\bigg(\frac{1}{nh}\bigg) \\
&=
\frac{z}{nh}f(s) + o\bigg(\frac{1}{nh}\bigg)
\end{aligned}
$$
where $z = \int K^2(u) du$.
#### Activity 4
What happens to variance as $h$ changes? As $n$ changes?
#### Activity 5
Given the Bias and Variance above, find the Mean Squared Error of our estimator. Recall that the formula for MSE is $Var() + Bias^2()$.
$$
\begin{aligned}
MSE(\hat{f}_n(s)) &=
Bias^2(\hat{f}_n(s)) + Var(\hat{f}_n(s)) \\
&=
\left(\frac{th^2}{2}f''(s) + o(h^2) \right)^2 + \frac{z}{nh}f(s) + o\left(\frac{1}{nh}\right) \\
&=
\frac{t^2h^4}{4}\left[f''(s)\right]^2 + \frac{z}{nh}f(s) + o(h^4) + o\left(\frac{1}{nh}\right)
\end{aligned}
$$
## Asymptotic MSE
Given the MSE we derived above, we can see that the Asymptotic MSE (AMSE) is $\frac{t^2h^4}{4}\left[f''(s)\right]^2 + \frac{z}{nh}f(s)$. Often, we use the AMSE to optimize $h$ at a given point.
#### Activity 6
Try optimizing the AMSE in terms of $h$. If you're up for a challenge, try finding the optimal $h$ for the sample in Activity 1 (with $h = 0.5$ and $s = 5$)!
$$
\begin{align*}
\frac{\partial}{\partial h} \textbf{AMSE}(\hat{f}_n(s))
&=
\frac{\partial}{\partial h}\left(\frac{t^2}{4}\left[f''(s)\right]^2\right)\mathbf{h^4} + \left(\frac{z}{n}f(s)\right)\mathbf{\frac{1}{h}} \\
&=
\left(t^2\left[f''(s)\right]^2\right)\mathbf{h^3} - \left(\frac{z}{n}f(s)\right)\mathbf{\frac{1}{h^2}} \\
0 &= \left(t^2\left[f''(s)\right]^2\right)\mathbf{h^5} - \left(\frac{z}{n}f(s)\right) \\
\mathbf{h_{opt}} &= \left(\frac{zf(s)}{nt^2\left[f''(s)\right]^2}\right)^{\frac{1}{5}}
\end{align*}
$$
## An Open Question: AMISE
The optimization in Activity 6 optimizes $h$ at a given $x$ value. A common method for optimizing $h$ across an entire distribution is using the asymptotic mean integrated square error (AMISE). As the name suggests, we can define the AMISE as follows:
$$
\textbf{AMISE}(\hat{f}_n(s)) = \int \left(\frac{t^2h^4}{4}\left[f''(s)\right]^2 + \frac{z}{nh}f(s) \right) dx
$$
It's possible to optimize AMISE in terms of $h$ to get the optimal bandwidth across an entire sample. We won't ask you to do this -- it's definitely a challenge problem! In the end, you'd find that the optimal $h$ is dependent upon $\int \left[f''(x)\right]^2 dx$, or the curvature of the underlying PDF. Many recent advancements in KDE studies have been in the realm of estimating AMISE or the underlying curvature in order to optimize $h$.