-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02_a2-task5.Rmd
147 lines (131 loc) · 4.63 KB
/
02_a2-task5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "02_a2-task5"
author: "Lukas Schmid"
date: "16 1 2020"
output: html_document
---
Repeat the steps we took in order to get from the Likelihood function representing a linear regression to the optimization problem:
$$w = arg min \sum_{1}^{N} (y_n - w^Tx_n)^2$$
The likelihood function representing a linear regression is derived by the distribution $p(y|x)$ in a linear model. Since $y = w^T x + \epsilon$ (lecture 3, p. 18), where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ (meaning $\epsilon$ is normally distributed around 0 with a standard deviation of $\sigma^2$), $y$ is itself normally distributed and has a mean of $y = w^T x$ and a standard deviation of $\sigma^2$.
The normal distribution $\mathcal{N}(0, \sigma^2)$ has the probability density function:
$$
\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x- \mu}{\sigma})^2} = \notag \\
= \frac{1}{\sigma\sqrt{2\pi}}\text{exp}(-\frac{1}{2}(\frac{x- \mu}{\sigma})^2) = \notag \\
= \frac{1}{\sigma\sqrt{2\pi}}\text{exp}(-\frac{(x-\mu)^2}{2\sigma^2})
$$
Inserting $y$ into this equation, we get:
$$p(y|x) = \frac{1}{\sigma\sqrt{2\pi}}\text{exp}(-\frac{(y-w^T x)^2}{2\sigma^2})$$
Assuming independence of $\forall y \in Y$, we can calculate $P(Y | w, X)$ as the product of $\prod_{n=1}^{N}p(y_n | x_n)$. This probability is the likelihood $\mathcal{L}$ of generating a data $D$, given a certain $w$. Put in another way, we calculate how probable it is that the observed data has been generated using a given w.
$$\mathcal{L}(w|D) \prod_{n=1}^{N}p(y_n | x_n) \\
\prod_{n=1}^{N}\Bigg[\frac{1}{\sigma\sqrt{2\pi}}\text{exp}\bigg(-\frac{(y_n-w^T x_n)^2}{2\sigma^2}\bigg)\Bigg]$$
This is still what we covered in class. Now, the task is to choose $w$ as to maximise $\mathcal{L}(w|D)$:
$$\underset{w}{\text{arg max }}\mathcal{L}(w|D) = \text{max }\mathcal{L}(w|D) = \text{max}\big(\text{log }\mathcal{L}(w|D)\big)$$
since the logarithm is a monotone function. Just looking at the part we try to maximise:
$$\text{log }\mathcal{L}(w|D) = \text{log }\big( \prod_{n=1}^{N}\Bigg[\frac{1}{\sigma\sqrt{2\pi}}\text{exp}\bigg(-\frac{(y_n-w^T x_n)^2}{2\sigma^2}\bigg)\Bigg]\big) \\
=
\text{log }\big(
\prod_{n=1}^{N}\Bigg[
\frac{1}{\sigma\sqrt{2\pi}}
\Bigg]
\prod_{n=1}^{N}\Bigg[
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\Bigg]
\big) \\
=
\text{log }\big(
\prod_{n=1}^{N}\Bigg[
\frac{1}{\sigma\sqrt{2\pi}}
\Bigg]
\big) +
\text{log }\big(
\prod_{n=1}^{N}\Bigg[
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\Bigg]
\big) \\
=
\text{log }\big(
\Bigg[
\frac{1}{\sigma\sqrt{2\pi}}
\Bigg]^N
\big) +
\text{log }\big(
\prod_{n=1}^{N}\Bigg[
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\Bigg]
\big) \\
=
\text{log }1 -
\text{log }\big(
(\sigma\sqrt{2\pi})^N
\big) +
\text{log }\big(
\prod_{n=1}^{N}\Bigg[
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\Bigg]
\big) \\
=
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) +
\text{log}\big(
\prod_{n=1}^{N}\Bigg[
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\Bigg]
\big) \\
=
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) +
\sum_{n=1}^{N}\Bigg[
\text{log}\big(
\text{exp}\bigg(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\bigg)
\big)
\Bigg] \\
=
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) +
\sum_{n=1}^{N}\Bigg[
\text{log}(e)\cdot
\big(
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\big)
\Bigg] \\
=
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) +
\sum_{n=1}^{N}\Bigg[
-\frac{(y_n-w^T x_n)^2}{2\sigma^2}
\Bigg] \\
=
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) -
\frac{1}{2\sigma^2}\sum_{n=1}^{N}(y_n-w^T x_n)^2 $$
The last line is almost the term on page 20 of lecture 3 (the difference lies with the term $-\frac{1}{2\sigma^2}$, in the slides it is $-\frac{1}{\sigma^2}$).
Instead of maximising the function, we can minimise its gradient:
$$\underset{w}{\text{arg max }}\mathcal{L}(w|D) \\
= \underset{w}{\text{arg min }}\frac{\partial}{\partial w}\mathcal{L}(w|D) \\
= \underset{w}{\text{arg min }}\frac{\partial}{\partial w}\Big[
- \text{log}\big(
(\sigma\sqrt{2\pi})^N
\big) -
\frac{1}{2\sigma^2}\sum_{n=1}^{N}(y_n-w^T x_n)^2
\Big]$$
Sadly, I don't know enough about gradients (yet) to calculate this specific one (any help and resources are welcome). From what I assume, we can treat this gradient as if it was a simple derivative of w, which would drop $-\text{log}\big((\sigma\sqrt{2\pi})^N\big)$. Anyway, since we only want to minimise the function, we do not care for any constant terms, so we can also drop
$\frac{1}{2\sigma^2}$.
Thus, the best hypothesis or $w$ is that which minimises w such that:
$$w = \underset{w}{\text{arg min }}\sum_{n=1}^{N}(y_n-w^T x_n)^2$$