-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction-machine-learning.Rmd
431 lines (338 loc) · 12.5 KB
/
introduction-machine-learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
---
title: "Introduction to Machine Learning"
author: "Ismaël Lajaaiti"
date: "06/03/2022"
output: html_document
editor_options:
markdown:
wrap: 72
chunk_output_type: console
chunk_output_type: console
---
Inspired from [this tutorial](https://torch.mlverse.org/start/guess_the_correlation/).
## Set up
```{r, echo = F}
set.seed(113) # set seed for reproducibility
```
Installing the Machine Learning framework.
```{r}
installed.pkg = installed.packages()[,1] # installed packages on your system
# If not installed, download the required packages.
required_pkg = c("torch", "torchvision") # list of required packages
for (pkg in required_pkg) {
if (!is.element(pkg, installed.pkg)){install.packages(pkg)}
}
```
In addition download the dataset we will be using.
```{r}
remotes::install_github("mlverse/torchdatasets")
```
Load libraries.
```{r}
library("torch")
library("torchvision")
library("torchdatasets")
```
Quick test to check if torch is installed correctly:
```{r, collapse = T}
torch_tensor(1)
```
## Prepare the dataset
*Guess the correlation* is a dataset that asks one to guess the
correlation between two variables from their scatter plot.
For example can you guess the correlation between the variables
displayed in the scatter plots below?
```{r, echo = F, fig.width = 9, fig.height = 5}
par(mfrow=c(1,2)) # 1 row, 2 columns
lim = c(-0.2,1.2) # axis limits
n = 500 # number of observations
# Plot two uncorrelated variables.
X1 = runif(n)
Y1 = runif(n)
plot(X1, Y1, xlim = lim, ylim = lim, pch = 20)
# Plot two correlated variables.
X2 = runif(n)
Y2 = X2 + rnorm(n, 0, 0.1)
plot(X2, Y2, xlim = lim, ylim = lim, pch = 20)
```
The answer is...
```{r, collapse = T}
cat("cor(X1,Y1) = ", cor(X1,Y1))
cat("cor(X2,Y2) = ", cor(X2,Y2))
```
This dataset contains contains 150,000 observations, where each
observation contains: a scatter plot (input) and the corresponding
correlation (output). As the dataset is large, to reduce the training time of
the neural network we will restrict ourselves to a subset.
```{r}
train_ind = 1:10000 # 10,000 observations for training
valid_ind = 10001:15000 # 5,000 observations for validation
test_ind = 15001:20000 # 5,000 observations for testing
```
Let's download the data and pick the indices for the train set.
```{r, collapse = T}
root = file.path(tempdir(), "correlation")
train_ds = guess_the_correlation_dataset(root = root, indexes = train_ind,
download = T)
```
First, we will have a look at the first observation to see how it looks like.
Each observation has two parts:
- `$x` = the input data, here the scatter plot
- `$y` = the expected output (or the 'label'), here the correlation
```{r}
first_element = train_ds[1]
input = first_element$x # torch tensor (150,150) = scatter plot
output = first_element$y # torch tensor (1,) = correlation coef.
```
The scatter plot is encoded as a tensor, but we can plot it as follow:
```{r, fig.width = 4, fig.height = 4}
input %>% as.array() %>% as.raster() %>% plot()
```
The scatter plot includes the axis that are unnecessary to infer the
correlation, then let's remove them by cropping the picture.
```{r, fig.width = 4, fig.height = 4}
cropped_input = input[0:130, 21:150] # input without axis
cropped_input %>% as.array() %>% as.raster() %>% plot() # plot to check
```
We have already create the training set above (`train_ds`). In addition,
we create the validation and test sets. As the data was already
download before, we do not need to download it again. We just pick
different indices.
```{r}
# Validation set.
valid_ds = guess_the_correlation_dataset(root = root, indexes = valid_ind,
download = F)
# Test set.
test_ds = guess_the_correlation_dataset(root = root, indexes = test_ind,
download = F)
```
Quick check of the size of our train, validation and test sets
```{r collapse=TRUE}
length(train_ds)
length(valid_ds)
length(test_ds)
```
To make the neural network training faster, we will work with *batches*.
A batch is a subset that contains a bunch of observations (here 64).
During the training, instead of going over each observation
individually, the neural network will process simultaneously the observations
belonging to the same batch. In the torch framework, this process of splitting
each set in batches is handled by the objects called `dataloader`.
```{r}
train_dl = dataloader(train_ds, batch_size = 64, shuffle = T)
valid_dl = dataloader(valid_ds, batch_size = 64, shuffle = F)
test_dl = dataloader(test_ds , batch_size = 64, shuffle = F)
```
The length of the dataloader is equal to its number of batches. That
latter can be recovered by dividing the number of observations in the
dataset by the batch size (here 64).
```{r collapse = T}
length(train_dl)
ceiling(length(train_ds) / 64)
```
Let's have look to the first batch.
```{r collapse = T}
first_batch <- dataloader_make_iter(train_dl) %>% dataloader_next()
dim(first_batch$x) # 64 first scatter plots (64, 150, 150)
dim(first_batch$y) # 64 first correlations (64,)
```
Our data is well-prepared and ready to be given to our neural
network. Then we can now build the neural network.
## Build the neural network
In the torch framework, a neural network is defined by two functions:
1. `initialize` which describes the structure of the network i.e. the
number of layers and the number of neurons per layer
2. `forward` which tells how the data goes through the different layers
previously defined
```{r}
torch_manual_seed(113) # set seed for reproducibility
# Create neural network.
dnn <- nn_module(
"corr-dnn", # network name
# Define layers.
initialize = function(n_in, n_hidden, n_out) {
self$fc1 <- nn_linear(in_features = n_in , out_features = n_hidden)
self$fc2 <- nn_linear(in_features = n_hidden, out_features = n_hidden)
self$fc3 <- nn_linear(in_features = n_hidden, out_features = n_hidden)
self$fc4 <- nn_linear(in_features = n_hidden, out_features = n_out)
},
# How input goes through the network.
forward = function(x) {
x %>% # input
self$fc1() %>% # 1st layer
nnf_relu() %>% # activation function
self$fc2() %>% # 2nd layer
nnf_relu() %>% # activation function
self$fc3() %>% # etc.
nnf_relu() %>% # ...
self$fc4() # output
}
)
```
The `initialize` function takes three arguments:
1. `n_in`: the size of the input, here the number of pixels of the
scatter plot, we flatten the 2d picture of size (height, width) into
a 1d vector of size (width\*height)
2. `n_hidden`: the number of neurons in the hidden layers
3. `n_out`: the size of output, here 1 as the output is the correlation
Arguments can be added, deleted or replaced depending on your needs. Here is
just a simple example that fits most situations.
```{r}
height = dim(cropped_input)[1] # height of the picture
width = dim(cropped_input)[2] # width
n_in = height * width # total number of pixels
n_out = 1 # expect one value in output: the correlation
n_hidden = 100 # play with this value to optimize the network
```
We can call the function and create our neural network, and check its
structure.
```{r collapse = T}
network = dnn(n_in, n_hidden, n_out)
opt = optim_adam(params = network$parameters) # for param. opt. during training
network
# ----------------------------------------
# WEIGHTS + BIAS = PARAMETERS
# ----------------------------------------
# node_in*n_out + n_out
n_in * n_hidden + n_hidden
n_hidden * n_hidden + n_hidden
n_hidden * n_hidden + n_hidden
n_hidden * n_out + n_out
```
The network is ready to be trained.
## Training loop
One iteration of the training loop is called an *epoch* and
can be summarized as follow:
1. Training over the batches of the training set
a. Make prediction on a batch
b. Compute the prediction error
c. Change weight to reduce the prediction error
d. Repeat for next batch, until the end of train set...
2. Validation over the batches of the validation set
a. Make prediction on a batch
b. Compute prediction error
d. Repeat for next batch, until the end of validation set...
3. Early stopping check
a. If error has decreased continue the training
b. Else stop the training
To build that loop we specify first how the network should process
batches during the training step...
```{r}
trainStep = function(network, batch){
opt$zero_grad()
batch_size = dim(batch$x)[1]
input = batch$x
input = input[1:batch_size, 0:130, 21:150] # crop axis
input = input$reshape(c(batch_size, 130*130)) # flatten input
output.pred = network(input)$squeeze(2) # correlation predicted by the DNN
output.true = batch$y # expected correlation
loss = nnf_mse_loss(output.pred, output.true) # error between pred and true
loss$backward() # backward propagation
opt$step()
loss$item() # get value
}
```
... then during the validation step...
```{r}
validStep = function(network, batch){
batch_size = dim(batch$x)[1]
input = batch$x
input = input[1:batch_size, 0:130, 21:150] # crop axis
input = input$reshape(c(batch_size, 130*130)) # flatten input
output.pred = network(input)$squeeze(2) # correlation predicted by the DNN
output.true = batch$y # expected correlation
loss = nnf_mse_loss(output.pred, output.true) # error between pred and true
loss$item() # get value
}
```
... and now we can assemble the pieces.
```{r}
trainingLoop = function(network, train_dl, valid_dl,
epoch_max = 5, patience = 2){
# Initialize variables.
epoch = 1
count = 0
loss.previous = 100
# Training loop.
while(epoch <= epoch_max & count < patience){
# Train.
network$train()
Loss.train = c()
coro::loop(for (batch in train_dl){
loss.train = trainStep(network, batch) # go over the batch
Loss.train = c(Loss.train, loss.train) # save loss value
})
cat(sprintf("Epoch %0.3d/%0.3d - Train - Loss = % 3.5f \n",
epoch, epoch_max, mean(Loss.train))) # progress
# Validation.
network$eval()
Loss.valid = c()
coro::loop(for (batch in valid_dl){
loss.valid = validStep(network, batch) # go over the batch
Loss.valid = c(Loss.valid, loss.valid) # save loss value
})
cat(sprintf("Epoch %0.3d/%0.3d - Valid - Loss = % 3.5f \n",
epoch, epoch_max, mean(Loss.valid))) # progress
# Check for early stopping.
loss.now = mean(Loss.valid) # mean loss at the current epoch
if (loss.now > loss.previous){ # if loss has increase, increment counter
count = count + 1}else{ # else, reset counter and loss baseline value
count = 0
loss.previous = loss.now
}
epoch = epoch + 1 # next epoch
}
}
```
Training the network (it can take some times...)
```{r collapse = T}
trainingLoop(network, train_dl, valid_dl, epoch_max = 5)
```
## Evaluate the network
Our network is now trained and ready to predict correlations from scatter plots
it has *never* seen. That what the test set is for.
Let's specify how the network is tested on one batch...
```{r}
testStep = function(network, batch){
batch_size = dim(batch$x)[1]
input = batch$x
input = input[1:batch_size, 0:130, 21:150] # crop axis
input = input$reshape(c(batch_size, 130*130)) # flatten input
output.pred = network(input)$squeeze(2) # correlation predicted by the DNN
output.true = batch$y # expected correlation
output = list(pred = output.pred, true = output.true)
return(output)
}
```
... and then loop over all the batches of the test set.
```{r}
testLoop = function(network, test_dl){
Pred = c() # store predicted correlations
True = c() # store true correlations
coro::loop(for (batch in test_dl){
output = testStep(network, batch) # go over the batch
Pred = c(Pred, output$pred %>% as.array()) # save pred. values
True = c(True, output$true %>% as.array()) # save true values
})
out = list(true = True, pred = Pred)
return(out)
}
```
Testing the network.
```{r}
out = testLoop(network, test_dl)
```
The most direct way to have a quick idea of how well our network performs is to
plot the predicted correlations against the true correlations.
```{r, fig.width = 7, fig.height = 7}
# Plot Pred vs. True.
plot(out$true, out$pred, xlab = 'True correlation',
ylab = 'Predicted correlation', cex = 0.5, col = rgb(.1,.1,.1, 0.6))
# Plot 1:1 line.
abline(0, 1, lw = 4)
# Add legend.
legend("topleft", legend = c("Data", "1:1"), lwd = c(NA, 4), pch = c(1, NA))
```
The performances of the network can be quantified more accurately
(e.g. by decomposing the total error in variance and bias),
but that's beyond the scope of this introduction.