-
Notifications
You must be signed in to change notification settings - Fork 18
/
derivatives.Rmd
295 lines (263 loc) · 7.51 KB
/
derivatives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Derivatives
## Smooth functions
A smooth function is one that is continuously differentiable over its
domain. Unless explicitly stated, all functions under consideration
are differentiable as many times as necessary.
## Derivatives
Consider a smooth, univariate function $f:\mathbb{R}
\rightarrow \mathbb{R}.$ Its derivative function, $f':\mathbb{R} \rightarrow
\mathbb{R}$, maps a scalar $x \in \mathbb{R}$ to the derivative of
$f(x)$ with respect to $x$, and is defined as a limit
$$
f'(x)
= \frac{\textrm{d}}{\textrm{d}x}f(x)
= \lim_{\epsilon \rightarrow 0} \frac{f(x + \epsilon) - f(x)}{\epsilon},
$$
where the numerator is the change in $y = f(x)$ and the denominator is
the change in $x$.
Rewriting the expression for the derivative,
$$
\lim_{\epsilon \rightarrow 0}
\frac{f(x + \epsilon) - (f(x) + \epsilon \cdot f'(x))}
{\epsilon}
= 0.
$$
Thus, as $\epsilon \rightarrow 0$,
$$
f(x + \epsilon) \approx f(x) + \epsilon \cdot f'(x).
$$
## Partial derivatives
For a function $f:\mathbb{R}^N \rightarrow \mathbb{R}$, the partial
derivative with respect to $x_n$ at $x \in \mathbb{R}^N$ is
$$
\frac{\partial}{\partial x_n} f(x) = g'(x_n),
$$
where $g : \mathbb{R} \rightarrow \mathbb{R}$ is the univariate
function defined by
$$
g(u) = f(x_1, \ldots, x_{n-1}, u, x_{n+1}, \ldots, x_N).
$$
That is, we differentiate $f(x)$ with respect to $x_n$, holding all
other values $x_1, \ldots, x_{n - 1}, x_{n + 1}, \ldots, x_N$
constant.
## Chain rule
Automatic differentiation is driven by the chain rule, which states
that
$$
\frac{\partial}{\partial x} f(u)
= \frac{\partial}{\partial u} f(u)
\cdot \frac{\partial}{\partial x} u
= f'(u) \cdot \frac{\partial}{\partial x} u.
$$
Including the differentiated function in the numerator makes the
relationship clearer,
$$
\frac{\partial f(u)}{\partial x}
=
\frac{\partial f(u)}{\partial u}
\cdot
\frac{\partial u}{\partial x}.
$$
## Gradients
Consider a smooth vector function $f:\mathbb{R}^N \rightarrow
\mathbb{R}.$ Its gradient function, $\nabla f:\mathbb{R}^N \rightarrow
\mathbb{R}^N,$ maps a vector $x \in \mathbb{R}^N$ to the vector of
partial derivatives of $f(x)$,
$$
\nabla f(x) =
\begin{bmatrix}
\frac{\partial}{\partial x_1} f(x)
& \cdots &
\frac{\partial}{\partial x_N} f(x)
\end{bmatrix}.
$$
The notation indicates that the result is a $1 \times N$ row vector.
The notation $\nabla^{\top} f$ indicates the function which transposes
the results of $\nabla f$ to produce column vectors, i.e.,
$$
\nabla^{\top} f(x)
= \left( \nabla f(x) \right)^{\top}
= \begin{bmatrix}
\frac{\partial}{\partial x_1} f(x)
\\
\vdots
\\
\frac{\partial}{\partial x_N} f(x)
\end{bmatrix}.
$$
## Gradient-vector products
Given a smooth function $f : \mathbb{R}^N \rightarrow \mathbb{R}$, the
derivative of $f$ along a vector $v$ at a point $x \in \mathbb{R}^N$
is defined by
$$
\nabla_v f(x)
=
\lim_{\epsilon \rightarrow 0}
\frac{f(x + \epsilon \cdot v) - f(x)}{\epsilon}.
$$
This definition is equivalent to defining the derivative along a
vector as a standard gradient times the vector,
$$
\nabla_v f(x)
= \nabla f(x) \cdot v
= \sum_{n = 1}^N f_n(x) \cdot v_n
= \sum_{n = 1}^N \frac{\partial f(x)}{\partial x_n} \cdot v_n.
$$
The vector multiplication is conformal because $\nabla f(x)$ is a row
vector by definition, whereas $v$ is a standard column vector.
## Directional derivatives
A unit vector $u \in \mathbb{R}^N$, i.e., one where where $u^{\top}
\cdot u = \sum_{n =
1}^N u_n^2 = 1$, picks out a point on a sphere and hence a direction.
The derivative of $f$ along a unit vector $u$ at the point $x$ is called
a directional derivative, as it provides the change in $f$ in the
direction picked out by $u.$
## Jacobians
Consider a smooth multivariate function $f:\mathbb{R}^N \rightarrow
\mathbb{R}^M$. Its Jacobian function, $\textrm{J}_f:\mathbb{R}^N \rightarrow
\left( \mathbb{R}^N \times \mathbb{R}^M\right),$ maps a vector $x \in
\mathbb{R}^N$ to the $M \times N$ matrix of partial derivatives of
each element of $f(x)$ with respect to each $x_n$,
$$
\textrm{J}_f(x) = \frac{\partial}{\partial x}f(x).
$$
Row $m$ of the Jacobian matrix is a gradient for a single output,
and the entries make up all of the partial derivatives of $f$,
$$
\textrm{J}_f(x) =
\begin{bmatrix}
\nabla f_1(x)
\\
\vdots
\\
\nabla f_M(x)
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial}{\partial x_1} f_1(x)
& \cdots &
\frac{\partial}{\partial x_N} f_1(x)
\\
\vdots & \vdots & \vdots
\\
\frac{\partial}{\partial x_1} f_M(x)
& \cdots &
\frac{\partial}{\partial x_N} f_M(x)
\end{bmatrix},
$$
where $f_m(x)$ is defined by selecting the $m$-th element of
$f(x)$,^[Notations $v[i]$ and $v_i$ will be used interchangeably for
the $i$-th element of vector $v$, and similarly for $m[i, j]$ and
$m_{i,j}$ for the elements of matrices.]
$$
f_m(x) = f(x)[m].
$$
Elementwise, the entries of the Jacobian are
$$
\textrm{J}_f(x)[m, n] = \frac{\partial}{\partial x_n}f_m(x).
$$
## Hessians
Consider a smooth multivariate function $f : \mathbb{R}^N \rightarrow
\mathbb{R}$ of a single output. The Hessian function $\textrm{H}_f$ maps an
element $x \in \mathbb{R}^N$ to its matrix of second derivatives, and
is defined by applying the gradient operator twice (with a
transposition in between),
$$
\textrm{H}_f(x)
=
\nabla \nabla^{\top} f(x)
=
\nabla
\begin{bmatrix}
\frac{\partial}{\partial x_1} f(x)
\\
\vdots
\\
\frac{\partial}{\partial x_N} f(x)
\end{bmatrix}
=
\begin{bmatrix}
\nabla \frac{\partial}{\partial x_1} f(x)
\\
\vdots
\\
\nabla \frac{\partial}{\partial x_N} f(x)
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial^2}{\partial x_1 \partial x_1} f(x)
& \cdots &
\frac{\partial^2}{\partial x_1 \partial x_N} f(x)
\\
\vdots & \vdots & \vdots
\\
\frac{\partial^2}{\partial x_N \partial x_1} f(x)
& \cdots &
\frac{\partial^2}{\partial x_N \partial x_N} f(x)
\end{bmatrix}.
$$
Elementwise, the entries in the Hessian evaluated at $x$ are
$$
\textrm{H}_f(x)[m, n]
= \frac{\partial^2}{\partial x_m \partial x_n} f(x)
= \frac{\partial}{\partial x_m} \frac{\partial}{\partial x_n} f(x).
$$
The Hessian matrix is symmetric, with
$$
\textrm{H}_f(x)[m, n] = \textrm{H}_f(x)[n, m].
$$
The diagonals are the second partial derivatives,
$$
\textrm{H}_f(x)[n, n]
= \frac{\partial}{\partial x_n} \frac{\partial}{\partial x_n} f(x)
= \frac{\partial^2}{\partial x_n^2} f(x).
$$
## Hessian-vector products
Suppose $f : \mathbb{R}^N \rightarrow \mathbb{R}$ is a smooth
function. The product of the Hessian of $f$ at a point
$x \in \mathbb{R}^N$ and an arbitrary vector $v \in \mathbb{R}$
can be calculated as the gradient of a vector-gradient product,
$$
\textrm{H}_f(x) \cdot v
= \left( \nabla \nabla^{\top} f(x) \right) \cdot v
= \nabla \left( \nabla f(x) \cdot v \right)^{\top}.
$$
This can be verified elementwise,
$$
\begin{array}{rcl}
\left(\textrm{H}_f(x) \cdot v\right)\![m]
& = &
\textrm{H}_f(x)[m] \cdot v
\\[4pt]
& = &
\sum_{n = 1}^N \textrm{H}_f(x)[m, n] \cdot v[n]
\\[4pt]
& = &
\sum_{n = 1}^N \left( \nabla \nabla^{\top} f(x) \right)[m, n]
\cdot v_n
\\[4pt]
& = &
\sum_{n = 1}^N \frac{\partial^{2}}{\partial x_m \partial x_n} f(x)
\cdot v_n
\\[4pt]
& = &
\sum_{n = 1}^N \frac{\partial}{\partial x_m} \frac{\partial}{\partial
x_n} \left( f(x) \cdot v_n \right)
\\[4pt]
& = &
\frac{\partial}{\partial x_m} \sum_{n = 1}^N \frac{\partial}{\partial
x_n} \left( f(x) \cdot v_n \right)
\\[4pt]
& = &
= \nabla \left( \nabla f(x) \cdot v \right)^{\top}[m].
\end{array}
$$
Because Hessians are symmetric, $v$ can be multiplied on the left or
right
$$
v^{\top} \cdot \textrm{H}_f(x)
= \left( \textrm{H}_f(x) \cdot v \right)^{\top}.
$$
## References {-}
An excellent reference for both matrix algebra and multivariate
differential calculus is [@magnus:2019].