-
Notifications
You must be signed in to change notification settings - Fork 17
/
Advanced Deep Learning with Python=Ivan Vasilev;Note=Erxin.txt
162 lines (81 loc) · 5.75 KB
/
Advanced Deep Learning with Python=Ivan Vasilev;Note=Erxin.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
Advanced Deep Learning with Python=Ivan Vasilev;Note=Erxin
# Introduction
- reference
https://learning.oreilly.com/library/view/advanced-deep-learning/9781789956177/6960ab63-0339-42da-9431-41d15760b2ed.xhtml
- terms
long short-term memory (LSTM) and gated recurrent unit (GRU)
- VGG, ResNet, MobileNets, GoogleNet, Inception, Xception, and DenseNets. We'll also implement ResNet and Xception/MobileNets
- Natural Language ToolKit's (NLTK) text processing techniques
- Neural Turing Machines (NTM), differentiable neural computers, and MANN.
- neural networks (NNs)—the cornerstone of deep learning (DL).
# Linear algebra
- Magnitude (or length) is a generalization of the Pythagorean theorem for an n-dimensional space:
- Rank: Indicates the number of array dimensions.
- Shape: The size of each dimension.
- Vector matrix operation
Dot
Multipe
Broadcast
- probability
Theoretical: The event we're interested in compared to the total number of possible events
Empirical: This is the number of times an event we're interested in occurs compared to the total number of trials
- random variables and probability distributions
Discrete, which can take distinct separate values. For example, the number of goals in a football match is a discrete variable.
Continuous, which can take any value within a given interval. For example, a height measurement is a continuous variable.
Probability mass function (PMF) for discrete variables.
Probability density function (PDF) for continuous variables
- mean (or expected value) is the expected outcome of an experiment over many observations.
- standard deviation measures the degree to which the values of the random variable differ from the expected value
- binomial distribution called Bernoulli distribution.
P(X) = n! / (x!(n-x)!) * p^x * (1-p)^(n-x)
- information theory
amount of information (or self-information) of event x as follows:
I(x) = -logP(x)
- differential calculus
chain
sum
common functions
- NNs, A NN is a function (let's denote it with f) that tries to approximate another target function
g(x) = f_theta(x)
- neurons
y_j = f(sigma(m, i=1)xi*w_(i,j) + b_j)
- main types of NN
Feed-forward, which are represented by acyclic graphs.
Recurrent (RNN), which are represented by cyclic graphs.
- activation function
Sigmoid: Its output is bounded between 0 and 1 and can be interpreted stochastically as the probability of the neuron activating.
Hyperbolic tangent (tanh): The name speaks for itself. The principal difference with the sigmoid is that the tanh is in the (-1, 1) range.
Leaky ReLU: When the input is larger than 0, leaky ReLU repeats its input in the same way as the regular ReLU does.
Parametric ReLU (PReLU, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Exponential linear units (ELU, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
f(x) = { x if x >= 0
{ alpha * (e^x - 1) if x < 0
Scaled exponential linear units (SELU, Self-Normalizing Neural Networks, https://arxiv.org/abs/1706.02515): This activation is similar to ELU,
f(x) = lambda * { x if x >= 0
{ alpha * (e^x - 1) if x < 0
The softmax output has some important properties:
Every value f(zi) is in the [0, 1] range.
The total sum of values of z is equal to 1: .
An added bonus (in fact, obligatory) is that the function is differentiable.
- cost function
MAE, mean absolute error
huber loss
cross-entropy loss
KL divergence, loss: Like cross-entropy loss, we already did the grunt work in the Information theory section, where we derived the relationship between KL divergence and cross-entropy loss.
backpropagation
- Adam adaptive learning rate algorithm (Adam: A Method for Stochastic Optimization, https://arxiv.org/abs/1412.6980).
# Computer vision
- Convolutional Neural Networks (CNNs) and their applications in Computer Vision (CV). CNNs started the modern deep learnin
- Generative Adversarial Networks (GANs), object detection, image segmentation, neural style transfer,
- understand cnn
s(t) = (f*g)(t) = _f_(-inf, +inf) f(z)g(t-z)dz
The convolution operation is denoted with *.
f and g are two functions with a common parameter, t.
The result of the convolution is a third function, s(t) (not just a single value)
integral of the product of f(t) and the reversed (mirrored) and shifted value of g(t-τ), where t-τ represents the shift. That is, for a single value of f at time t, we shift g in the range and we compute the product f(t)g(t-τ) continuously because of the integral.
The convolution of f and g at value t is the integral of the product of f(t) and the reversed (mirrored) and shifted value of g(t-τ), where t-τ represents the shift. That is, for a single value of f at time t, we shift g in the range and we compute the product f(t)g(t-τ) continuously. The integral (and hence the convolution) is equivalent to the area under the curve of the product of the two functions.
In the convolution operation, g is shifted and reversed in order to preserve the operation's commutative property. In the context of CNNs, we can ignore this property and we can implement it without reversing g. In this case, the operation is called cross-correlation. These two terms are used interchangeably.
generalize it to the convolution of functions with two shared input parameters, i and j:
+ cnn
In CNNs, the function f is the input of the convolution operation (also referred to as the convolutional layer). we have 1d, 2d, 3d, А time series input is a 1D vector, an image input is a 2D matrix, and a 3D point cloud is a 3D tensor
. The input cells, which contribute to a single output cell, are called receptive fields. We sum all of these values to produce the value of a single output cell.