-
Notifications
You must be signed in to change notification settings - Fork 5
/
autoencoders.py
2434 lines (2195 loc) · 123 KB
/
autoencoders.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#%%
# in the name of God the most compassionate the most merciful
# in this part, we are going to learn about autoencoders and
# how we can implement them in Pytorch.
# Autoencoders are a kind of networks that map their input
# to a new representation. this is usually refered to as
# compressing the input into a latent space representation,
# this means, they accept the data, and then downsample it
# until they reach a specified/suitable size feature vector
# and then upsample that feature vector gradually until they
# reach to the original size, and then try to reconstruct the
# input(so our input image acts as a label as well!).
# during this process of reconstructing the input data
# from the compressed representation, the new representation is
# developed and can be used for various applications.
# the first part of the network that downsamples the input data
# into a feature vector is called an "Encoder", and the part that
# reconstructs the input from the mentioned feature vector is called
# a "Decoder".
# when we have successfully trained our autoencoder, we can use its
# new representation instead of our new data. so we can use it for
# dimensionality reduction just like PCA (if linear) and much more
# powerful than that when using a deep nonlinear autoencoder!
# we can use the new representation for lots of applications including
# sending /storing the reduced representation instead of the full input
# and reconstruct the input using the representation, this will result
# in a considerable reduction in network traffic or space required to store
# the actual data.
# The usage is not limited to such usescases, we can get fancy and creative
# for example and make a black and white image , color again! or denoise our input
# reconstruct missing parts, create new data/images, visualizations, etc!
# there are lots and lots of use cases for autoencoders
# However, note that, the notion of compression spoke here is different than that of
# what you find in different media formats such as jpeg, mp3, etc.
# Autoencoders do not work well on unseen data and thus usually have difficulties
# generalizing well to unseen data. more on this later
# There are different kinds of Autoencoders, they can be linear, or
# nonlinear, shallow, or deep, convolutional, or not, etc
# we will cover some of the most famous variants here.
# lets start
# before we start lets get familiar with couple of concepts
# note :
# https://www.statisticshowto.datasciencecentral.com/posterior-distribution-probability/
# Posterior probability is the probability an event will happen after all evidence or
# background information has been taken into account. It is closely related to prior probability,
# which is the probability an event will happen before you taken any new evidence into account.
# You can think of posterior probability as an adjustment on prior probability:
# Posterior probability = prior probability + new evidence (called likelihood).
# For example, historical data suggests that around 60% of students who start college will
# graduate within 6 years. This is the prior probability. However, you think that figure is
# actually much lower, so set out to collect new data. The evidence you collect suggests that
# the true figure is actually closer to 50%; This is the posterior probability.
# What is a Posterior Distribution?
# The posterior distribution is a way to summarize what we know about uncertain quantities in
# Bayesian analysis. It is a combination of the prior distribution and the likelihood function,
# which tells you what information is contained in your observed data (the “new evidence”).
# In other words, the posterior distribution summarizes what you know after the data has been
# observed. The summary of the evidence from the new observations is the likelihood function.
# Posterior Distribution = Prior Distribution + Likelihood Function (“new evidence”)
# Posterior distributions are vitally important in Bayesian Analysis. They are in many ways
# the goal of the analysis and can give you:
# Interval estimates for parameters,
# Point estimates for parameters,
# Prediction inference for future data,
# Probabilistic evaluations for your hypothesis.
# ------------------------------------------------------------------------------
# https://www.statisticshowto.datasciencecentral.com/likelihood-function/
# What is a prior probablity :
# https://www.statisticshowto.datasciencecentral.com/prior-probability-uniformative-conjugate/
# Prior Probability: Uniformative, Conjugate
# Probability > Prior Probability: Uniformative, Conjugate
# What is Prior Probability?
# Prior probability is a probability distribution that expresses established beliefs about an
# event before (i.e. prior to) new evidence is taken into account. When the new evidence is used
# to create a new distribution, that new distribution is called posterior probability.
# prior probability
# For example, you’re on a quiz show with three doors. A car is behind one door,
# while the other two doors have goats. You have a 1/3 chance of winning the car. This is the
# prior probability. Your host opens door C to reveal a goat. Since doors A and B are the only
# candidates for the car, the probability has increased to 1/2. The prior probability of 1/3 has
# now been adjusted to 1/2, which is a posterior probability.
# In order to carry our Bayesian inference, you must have a prior probability distribution.
# How you choose a prior is dependent on what type of information you’re working with.
# For example, if you want to predict the temperature tomorrow, a good prior distribution
# might be a normal distribution with this month’s mean temperature and variance.
# Uninformative Priors
# An uninformative prior gives you vague information about probabilities. It’s usually used when
# you don’t have a suitable prior distribution available. However, you could choose to use an
# uninformative prior if you don’t want it to affect your results too much.
# The uninformative prior isn’t really “uninformative,” because any probability distribution
# will have some information. However, it will have little impact on the posterior distribution
# because it makes minimal assumptions about the model. For the temperature example,
# you could use a uniform distribution for your prior, with the minimum values at the record low
# for tomorrow and the record high for the maximum.
# Conjugate Prior
# A conjugate prior has the same distribution as your posterior prior. For example, if you’re
# studying people’s weights, which are normally distributed, you can use a normal distribution
# of weights as your conjugate prior.
# lets start!
import datetime
import numpy as np
import torch
import torchvision
from torchvision import datasets, transforms
from torchvision.utils import save_image, make_grid
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
%matplotlib inline
# We will use MNIST dataset for our experiments here.
# Lets get back to our discussion!
# We mentioned couple of examples/usecases for autoencoders, but why do we have to
# shrink the size in the encoder part ? why do we gradually reduce the input size until
# we reach a feature vector of some size?
# shrinking the size gradually, acts as a imposing a constraint on the input
# by doing so, we are forcing the network to choose the important features in
# our input data, the features that has the essence of our input data and can later
# be used to reconstruct the input. This is why the new resuling representation works
# very well and can be used instead of the input for some applications. The new repres
# -entation simply has the most important features in the input.
# if such constraint was not present, the network would not be able to learn anything
# meaningful about the distribution of our input data and thus the resulting vector
# would be of no use to us. So we should be shrinking the input until we reach
# a certain size of our liking(based on our usage)
# Note that, these new features may not be individually interpretable specially in the
# case of nonlinear deep autoencoders. there are ways to see what and how a specific
# feature in the resulting feature vector responds to different attributs present in
# an input but never make this mistake that e.g. the 10 features in the bottleneck layer
# (our feature vector) represents the exact attributes in your input. these features
# may represent a complex interactions between several features that define a
# characteristic in you data. anyway we'll get to this later on.
# Ok, enough talking lets get busy and have our first auto encoder.
# before we continue, we should pickup a dataset. I chose MNIST as its simple enough
# to be used in different types of autoencoders with quick training time.
# after we created our dataset, we will implement different types of AutoEncoders
dataset_train = datasets.MNIST(root='MNIST',
train=True,
transform = transforms.ToTensor(),
download=True)
dataset_test = datasets.MNIST(root='MNIST',
train=False,
transform = transforms.ToTensor(),
download=True)
batch_size = 128
num_workers = 0
dataloader_train = torch.utils.data.DataLoader(dataset_train,
batch_size = batch_size,
shuffle=True,
num_workers = num_workers,
pin_memory=True)
dataloader_test = torch.utils.data.DataLoader(dataset_test,
batch_size = batch_size,
num_workers = num_workers,
pin_memory=True)
# lets view a sample of our images
def view_images(imgs, labels, rows = 4, cols =11):
# images in pytorch have the shape (channel, h,w) and since we have a
# batch here, it becomes, (batch, channel, h, w). matplotlib expects
# images to have the shape h,w,c . so we transpose the axes here for this!
imgs = imgs.detach().cpu().numpy().transpose(0,2,3,1)
fig = plt.figure(figsize=(8,4))
for i in range(imgs.shape[0]):
ax = fig.add_subplot(rows, cols, i+1, xticks=[], yticks=[])
# since mnist images are 1 channeled(i.e grayscale), matplotlib
# only accepts these kinds of images without any channesl i.e
# instead of the shape 28x28x1, it wants 28x28
ax.imshow(imgs[i].squeeze(), cmap='Greys_r')
ax.set_title(labels[i].item())
plt.tight_layout(pad=1,rect= (0, 0, 40, 40))
# now lets view some
imgs, labels = next(iter(dataloader_train))
view_images(imgs, labels,13,10)
# good! we are ready for the actual implementation
#%%
# The first autoencoder weare going to implement is the simplest one,
# a linear autoencoder.
# creating an autoencoder is just like any other module we have seen so far, simply
# inherit from nnModule and define the needed layers and call them in the forward()
# method the way you should. lets do this :
class LinearAutoEncoder(nn.Module):
def __init__(self, embedingsisze=32):
super().__init__()
# lets define our autoencoder we have two parts, an encoder
# and a decoder.
# the encoder shrinks the input gradually until it becomes
# a certain size, and the decoder accepts that as input and
# gradually upsamples it to reach the actual input size.
# The encoder part:
# So our encoder part simply is a linear
# layer, or a fully connected layer that
# accepts the input. since this is a linear layer,
# we have to flatten the input and our 28x28 image
# will simply have 28x28=784 input features
# The simplest form can be an a one layered encoder
# and a 1 layered decoder! of course we can add more
# layers between them, but lets see how this performs
self.fc1 = nn.Linear(28*28, embedingsisze)
# our decoder part
self.fc2 = nn.Linear(embedingsisze, 28*28)
def forward(self, inputs):
# our foward pass is nothing specially
# simply feed these layers in order!
# but before that, we must flatten our input!
inputs = inputs.view(inputs.size(0), -1)
# encoder part
output = self.fc1(inputs)
# decore part
output = self.fc2(output)
# since in the output we want an image not a flattened
# evctor, we reshape our input again!
output = output.view(-1, 1, 28, 28)
return output
model_linear_ae = LinearAutoEncoder()
print(model_linear_ae)
#%%
# now lets train our model.
# since we compare the output of our network with our input
# we use MSELoss for this.
criterion = nn.MSELoss()
def train(model, dataloader, optimizer, scheduler, epochs, device):
for e in range(epochs):
# we dont need label so we use _ as its a convention
for i, (imgs,_) in enumerate(dataloader):
imgs = imgs.to(device)
reconstructed_images = model(imgs)
loss = criterion(reconstructed_images, imgs)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i% 2000==0:
print(f'epoch: ({e}/{epochs}) loss: {loss.item():.6f} lr:{scheduler.get_lr()}')
scheduler.step()
print('done')
# Now lets see the output of our autoencoder
def test(model,device):
imgs, labels = next(iter(dataloader_test))
imgs = imgs.to(device)
outputs = model(imgs)
view_images(outputs, labels)
#%%
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_linear_ae = model_linear_ae.to(device)
optimizer = optim.Adam(model_linear_ae.parameters(), lr = 0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
train(model_linear_ae, dataloader_train, optimizer, scheduler, 20, device)
test(model_linear_ae, device)
# so this is the linear autoencoder! in order to make a vanila autoencoder
# which may refer to a version with nonlinear activation functions, you
# only need to apply a transformation function .
# in the fowarad pass and in order to get a good result, you need to add a few more
# layers .(we do this in the next architecture )
# we can get better results with more epochs and decaying learnng rate,
# but it wont make a drastic change! specially on more complex data, as its
# just a linear model.
#%%
# in order to be able to capture more complex structures,... in the input data
# one way is to add more hidden layers! so
# Now lets create a multi layer auto encoder!
class MLPAutoEncoder(nn.Module):
def __init__(self, embedingsisze=32):
super().__init__()
self.fc1 = nn.Linear(28*28, 64)
self.fc2 = nn.Linear(64, embedingsisze)
# our decoder part
self.fc3 = nn.Linear(embedingsisze, 64)
self.fc4 = nn.Linear(64, 28*28)
def forward(self, inputs):
inputs = inputs.view(inputs.size(0), -1)
# encoder part
output = F.relu(self.fc1(inputs))
output = F.relu(self.fc2(output))
# decore part
output = F.relu(self.fc3(output))
# since the output is images, values should
# be in the range [0, 1]!
output = F.sigmoid(self.fc4(output))
output = output.view(-1, 1, 28, 28)
return output
model_mlp_ae = MLPAutoEncoder().to(device)
print(model_mlp_ae)
# criterion = nn.MSELoss()
optimizer = optim.Adam(model_mlp_ae.parameters(), lr = 0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
train(model_mlp_ae, dataloader_train, optimizer, scheduler, 20, device)
test(model_mlp_ae,device)
#%%
# While our mlp model is more powerful than the previous model, it is not suitable for data such as images
# for image like data, we use conv layers! and hence our new autoencoder is Convolutional AutoEncoder.
# lets see how to implement this :
# something that needs to be said is that, when the number of layers is increased, i.e.
# your network gets deeper, you may see that your model may train sometimes and not the
# other times and the loss may not decrease. when you see this, you should know this is
# happening becasue of the depth of your network. use the batchnorm and all will be good.
# thats why I created two functions for this very purpose. try creating your network with
# and without batchnormalization enabled and see the difference (try running for several
# times with the one with no batchnormalization to see that sometimes it may work and some
# times it will fail, but with batchnorm, it will always work!)
def conv_bn(in_,out_,k_size=3, s=2,pad=0,bias=False,batchnorm=True):
layers = []
layers.append(nn.Conv2d(in_,out_,kernel_size=k_size,stride=s,padding=pad,bias=bias))
if batchnorm:
layers.append(nn.BatchNorm2d(out_))
return nn.Sequential(*layers)
def deconv_bn(in_,out_,k_size=4, s=2,pad=0,bias=False,batchnorm=True):
layers = []
layers.append(nn.ConvTranspose2d(in_,out_,kernel_size=k_size,stride=s,padding=pad,bias=bias))
if batchnorm:
layers.append(nn.BatchNorm2d(out_))
return nn.Sequential(*layers)
class ConvAutoEncoder(nn.Module):
def __init__(self, embedingsize=32):
super().__init__()
# for conv layers, since we are dealing with 3d featuremaps
# we shrink the number of featuremaps as well as the spatial
# dimensions. we do so until we reach a size that satifies us
#
self.conv1 = conv_bn(1, 256, 3, 1) # each stride 2 downsamples the dimensions by half
self.conv2 = conv_bn(256, 128, 3, 2) # 14
self.conv3 = conv_bn(128, 64, 3, 2) # 7
self.conv4 = conv_bn(64, embedingsize, 3, 2) # output is 64x2x2
# decoder
# now for decoder we have two options, we can simply use a conv layer
# followed by a upsample layer. or we can use a deconv layeror a
# transposed convolution layer. the difference between them is that
# using the transposedconv approach, results in a checkerboard effect
# while thats not the case for upsample method!
# we use k=2,s=2, as it upsamples the image 2x.
# from there we can use different kernel size, strides
# to reach to desired dimensions.
self.deconv5 = deconv_bn(embedingsize, 64, 2, 2)
self.deconv6 = deconv_bn(64, 128, 4, 2)
self.deconv7 = deconv_bn(128, 256, 5, 2)
# and since our image is 1 channel, this last layer will produce a singe image!
self.conv8 = deconv_bn(256, 1, 6, 1,0,True,False)
def forward(self, x):
output = F.relu(self.conv1(x))
output = F.relu(self.conv2(output))
output = F.relu(self.conv3(output))
output = F.relu(self.conv4(output))
output = F.relu(self.deconv5(output))
output = F.relu(self.deconv6(output))
output = F.relu(self.deconv7(output))
# since we want an image, we use sigmoid
output = F.sigmoid(self.conv8(output))
return output
#%%
# now lets train it
model_c= ConvAutoEncoder()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
optimizer = optim.Adam(model_c.parameters(), lr =0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
model_c = model_c.to(device)
train(model_c, dataloader_train, optimizer, scheduler, 20, device)
test(model_c, device)
# As an excersize try to replace all ConvTranspose2d Layers with Conv2d+Upsample
# and see how the outputs turn out !
#%%
# Now lets create more powerful Convolutional AutoEncoders. the vanial convolutional autoencoder
# is not that powerful. therefore we can use several variants such as:
# denoising autoencoder, Sparse autoencoder, variational autoencoder
# using denoising autoencoder, our archietcture needs to be deep enough because its
# a more complex taks. however, our previous convautoencoder is deep enough so we can
# use that here as well.
# basically in denoising autoencoder, we feed a noisy image and get a noise free image
# so what we will actually do in the training process is to add random noise to our image
# prior to feeding it to our model and then compare the reconstructed image with the actual
# original image which is noise free. in doing this, network will learn to remove noise from
# images. we will use the same criterion. nearly 99% of what we saw until now is the same
# and we just will add a simplenoise lets see that
noise_threshold = 0.5
epochs = 20
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# the quality and performance of our model in denoising will increase as we
# increase the embedding size.
model = ConvAutoEncoder(1).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
# before we go on lets view a sample of noisy images :
imgs,labels = next(iter(dataloader_test))
imgs = imgs + noise_threshold * torch.rand_like(imgs)
imgs.clamp_(0,1)
view_images(imgs,labels)
print(model)
for e in range(epochs):
loss_epoch = 0.0
for imgs,_ in dataloader_train:
imgs = imgs.to(device)
#apply noise to our image
imgs_noisy = imgs + noise_threshold * torch.rand_like(imgs)
# clip all values outside of 0,1 becasue our image values
# should be in this range!
imgs_noisy = imgs_noisy.clamp(0,1)
imgs_recons = model(imgs_noisy)
loss = criterion(imgs_recons, imgs)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_epoch += loss.item()
print(f'epoch: {e}/{epochs} loss: {loss.item()} lr: {scheduler.get_lr()}')
scheduler.step()
# lets see how the network does on noisy image!
imgs,labels = next(iter(dataloader_test))
imgs = imgs.to(device)
imgs = imgs + noise_threshold * torch.rand_like(imgs)
imgs.clamp_(0,1)
view_images(imgs,labels)
new_noise_free_imgs = model(imgs)
view_images(new_noise_free_imgs,labels)
#%%
# you may ask, so far we have been starting with a large number of channels,
# and gradually decreased and at the same time shrunk the spatial extend, what if we do
# the opposite, we begin with few channels and large spatial extend
# and then gradually increase the channels and shrink the spatial dimension until
# you reach a large vector representation with little or no spatial extent. and in the
# decoder, we do the opposite obviously! lets see how that performs! (tldr !it performs worse!)
class ConvolutionalAutoEncoder_v2(nn.Module):
def __init__(self, embeddingsize=32):
super().__init__()
self.encoder = nn.Sequential(conv_bn(1, 32, 3, 1),
conv_bn(32, 64, 3, 2),
conv_bn(64, 128, 3, 2),
conv_bn(128, embeddingsize, 3, 2))
self.decoder = nn.Sequential(deconv_bn(embeddingsize, 128, 2, 2),
deconv_bn(128, 64, 4, 2),
deconv_bn(64, 32, 5, 2),
deconv_bn(32, 1, 6, 1,batchnorm=False))
def forward(self, inputs):
output = self.encoder(inputs)
return self.decoder(output)
noise_threshold = 0.5
epochs = 20
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# the quality and performance of our model in denoising will increase as we
# increase the embedding size.
model = ConvolutionalAutoEncoder_v2(32).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
# before we go on lets view a sample of noisy images :
imgs,labels = next(iter(dataloader_test))
imgs = imgs + noise_threshold * torch.rand_like(imgs)
imgs.clamp_(0,1)
view_images(imgs,labels)
print(model)
for e in range(epochs):
loss_epoch = 0.0
for imgs,_ in dataloader_train:
imgs = imgs.to(device)
#apply noise to our image
imgs_noisy = imgs + noise_threshold * torch.rand_like(imgs)
# clip all values outside of 0,1 becasue our image values
# should be in this range!
imgs_noisy = imgs_noisy.clamp(0,1)
imgs_recons = model(imgs_noisy)
loss = criterion(imgs_recons, imgs)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_epoch += loss.item()
print(f'epoch: {e}/{epochs} loss: {loss.item()} lr: {scheduler.get_lr()}')
scheduler.step()
# lets see how the network does on noisy image!
imgs,labels = next(iter(dataloader_test))
imgs = imgs.to(device)
imgs = imgs + noise_threshold * torch.rand_like(imgs)
imgs.clamp_(0,1)
view_images(imgs,labels)
new_noise_free_imgs = model(imgs)
view_images(new_noise_free_imgs,labels)
#%%
# sparse autoencoder: these kinds of autoencoders simply use a regularizer term so that
# the features are more sparse! usually l1 loss is used!
# In the previous examples, the representations were only constrained by the size of the
# hidden layers. In such a situation, what typically happens is that the hidden layer is
# learning an approximation of PCA (principal component analysis).
# But another way to constrain the representations to be compact is to add a sparsity
# contraint on the activity of the hidden representations, so fewer units would "fire"
# at a given time.
# in order to have sparsity, we need to have overcomplete representations. so lets
# implement a sparse autoencoder in this section and see how it performs.
# as I said earlier, aside from the normal reconstruction loss, we need a new regularizer
# lets create this regularizer now. We are going to create a Function object that applies
# l1penalty we inherit from autograd.Function class for this.
# good exlanation https://www.youtube.com/watch?v=7mRfwaGGAPg
import copy # sed for deep copy of our weights
from torch.autograd import Function # used for implementing l1_lenalty
class L1Penalty(Function):
# we override the forward method with our own arguments (input, l1_weight)
# input is the input obviously and l1_weight is the percentage of zero weights
# that is 0.1 means, we want 10% of our weights to be zero (or near zero)
# or sparsity ratio if you will!
# In the forward pass, we simply save our input and l1_weight for use in backwardpass
@staticmethod
def forward(ctx, input, l1_weight):
ctx.save_for_backward(input)
ctx.l1_weight = l1_weight
return input
# backward must accept a context `ctx` as the first argument, followed by
# as many outputs did `forward` return, and it should return as many
# tensors, as there were inputs to `forward`.
# Each argument is the gradient w.r.t the given output, and each returned
# value should be the gradient w.r.t. the corresponding input.
# The context can be used to retrieve tensors saved during the forward
# pass. It also has an attribute ctx.needs_input_grad` as a tuple
# of booleans representing whether each input needs gradient. E.g.,
# `backward` will have ``ctx.needs_input_grad[0] = True`` if the
# first input to `forward` needs gradient computated w.r.t. the
# output.
@staticmethod
def backward(ctx, grad_outputs):
input, = ctx.saved_tensors
# since we only need gradients with respect to the input
# we need to explicitly say we dont need gradienst to be
# calculated for our second argument term in forward method
# i.e. l1_weight. so we return None for other arguemnst that
# we dont want any gradient.
# this is a term that we apply in the backward pass,
# that is, we are enforcing the constraint by adding
# a new term to the gradient
grad_input = input.clone().sign().mul(ctx.l1_weight)
grad_input +=grad_outputs
# since we have two inputs in our foward pass, we need to
# provide two gradients in the backward pass. but becasue
# we only care about input and not the l1_weight, (we dont)
# need any gradients for it becsaue we are not tuning that!
# we return None
return grad_input, None
# now lets create our architecture
class SparseAutoEncoder(nn.Module):
def __init__(self, embeddingsize=400, tied_weights = False):
super().__init__()
self. tied_weights = tied_weights
self.encoder = nn.Sequential(nn.Linear(28*28, embeddingsize),
nn.Sigmoid())# or relu
self.decoder = nn.Sequential(nn.Linear(embeddingsize, 28*28),
nn.Sigmoid())
# you may see some people, use the shared weights between encoder
# and decoder, i.e. decoder uses the transposed weightmatrix of the
# encoder. for doing this there are couple of ways.
# one of way is to use the functional form and simply
# use one weight and its transpose like this
# weight = nn.Parameter(torch.rand(input_dim, output_dim))
# self.encoder = F.linear(input, weight, bias=False)
# self.decoder = F.linear(input, weight.t(), bias=False)
# we can also simply define our new weight and assigne it to both modules
if self.tied_weights:
weights = nn.Parameter(torch.randn_like(self.encoder[0].weight))
self.encoder[0].weight.data = weights.clone()
self.decoder[0].weight.data = self.encoder[0].weight.data.t()
def forward(self, input):
input = input.view(input.size(0), -1)
output_enc = self.encoder(input)
rec_imgs = self.decoder(output_enc)
rec_imgs = rec_imgs.view(input.size(0), 1, 28, 28)
return output_enc, rec_imgs
def sparse_loss_function(outputs_enc, reconstructed_imgs, imgs, penalty_type=0, l1_weight=0.01, Beta=1):
"""
penalty_type :
0: sparsity on activations
1: sparsity using l1 penalty using gradient enforcemet
2: sparsity using kl divergence
"""
criterion = nn.MSELoss()
loss = criterion(reconstructed_imgs, imgs)
if penalty_type == 0:
sparsity_loss = torch.mean(abs(outputs_enc))
return loss + sparsity_loss
elif penalty_type == 1:
# apply the l1penalty on the weights of our encoder
# through added term in backpropagation
output = L1Penalty.apply(outputs_enc, l1_weight)
return loss
else:
# use kl divergence, calculate ro^ which is the
# mean of activations in our hidden layer in which
# we want sparsity
# the idea here is that each neurons activation should be sparse
# that means, its values need to be zero or close to zero. now
# how do we do that? we set a threshold, we call it ro and set it
# to a value e.g. 0.05 and then check the mean of each neurons
# activations, and call it ro_hat, we compare our ro_hat against
# our threshold which is ro! then we penalize all neurons that
# their ro_hat is larger than the threshold. but how do we compare
# them? we use kl divergence. why? we can model two distributions (bernolli)
# being p and q with the probability of success ro and ro_hat respectively
# the idea is, to ensure the predicted distribution is as close to the
# actual one and we can model this with kl divergence
#
ro_hat = torch.mean(outputs_enc).to(imgs.device)
ro = torch.ones_like(ro_hat).to(imgs.device) * l1_weight
# ro and ro_hat must be probablities, what we have now is just logits
# so we use softmax to turn our logits into probabilties
# remember our activation function must be sigmoid
# print(ro.shape, ro_hat.shape)
kl = torch.sum(ro * torch.log(ro / ro_hat) +
(1 - ro) * torch.log((1 - ro) / (1 - ro_hat)))
return loss + (Beta * kl)
epochs = 50
penalty_type = 0
# ro 0.01 ~ 0.05 or l1_weight
sparsity_ratio = 0.1
loss_type = 2
# at the end read the Cyclical Annealing Schedule section to get a very good idea about
# how you can achieve better result and why!
Beta = 3
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
sae_model = SparseAutoEncoder(embeddingsize=400,
tied_weights=True).to(device)
optimizer = torch.optim.Adam(sae_model.parameters(), lr = 0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 10)
print(sae_model)
# lets save the weights of our encoder and decoders before we train them
# and then compare them with the new weights after training and see how
# they changed!
init_weights_encoder = copy.deepcopy(sae_model.encoder[0].weight.data)
init_weights_decoder = copy.deepcopy(sae_model.decoder[0].weight.data)
imgs_list =[]
# now lets start training !
for e in range(epochs):
for imgs,_ in dataloader_train:
imgs = imgs.to(device)
output_enc, rec_imgs = sae_model(imgs)
loss = sparse_loss_function(output_enc, rec_imgs, imgs, loss_type, sparsity_ratio, Beta)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'epoch: {e}/{epochs} loss: {loss.item():.6f} lr = {scheduler.get_lr()[-1]:.6f}')
scheduler.step()
# at each epoch, we sample one image and its reconstruction
# for viewing later on to see how the training affects the
# result we get
imgs_list.append((imgs[0],rec_imgs[0]))
#%%
# now lets first visualize the image/reconstruction pairs and how they look :
def visualize(imgs_list, rows=5, cols=10):
fig = plt.figure(figsize=(15,2))
plt.subplots_adjust(wspace=0,hspace=0)
print(f'number of samples: {len(imgs_list)}')
for i in range(len(imgs_list)):
img,recons = imgs_list[i]
# print(img.shape,recons.shape)
img = img.cpu()
recons = recons.cpu().detach()
ax = fig.add_subplot(rows, cols, i+1, xticks=[], yticks=[])
x = torchvision.utils.make_grid([img,recons])
ax.imshow(x.numpy().transpose(1,2,0))
visualize(imgs_list)
# Now lets visualize the weights and see how they look.
# we had the initial weights saved so lets subtract them
# from the trained one and see the diffs , it will show us
# where the changes happened
def visualize_grid(imgs, rows=20, cols=20):
fig = plt.figure(figsize=(20, 20))
imgs = imgs.cpu().numpy().transpose(0, 2, 3, 1).squeeze()
plt.subplots_adjust(wspace=0, hspace=0)
for i in range(imgs.shape[0]):
ax = fig.add_subplot(rows, cols, i+1, xticks=[], yticks=[])
ax.imshow(imgs[i], cmap='Greys_r')
def visualize_grid2(imgs, label, normalize=True):
fig = plt.figure(figsize=(10, 10))
imgs = imgs.cpu()
plt.subplots_adjust(wspace=0, hspace=0)
x = torchvision.utils.make_grid(
imgs, nrow=20, normalize=normalize).numpy().transpose(1, 2, 0)
ax = fig.add_subplot(1, 1, 1, xticks=[], yticks=[])
ax.imshow(x)
ax.set_title(label)
trained_W_encoder = sae_model.encoder[0].weight.data.cpu(
).clone().view(sae_model.encoder[0].out_features, 1, 28, 28)
trained_W_decoder = sae_model.decoder[0].weight.data.cpu(
).clone().view(sae_model.decoder[0].in_features, 1, 28, 28)
init_weights_encoder = init_weights_encoder.view(
sae_model.encoder[0].out_features, 1, 28, 28).cpu()
init_weights_decoder = init_weights_decoder.view(
sae_model.decoder[0].in_features, 1, 28, 28).cpu()
w_diff_encoder = init_weights_encoder - trained_W_encoder
w_diff_decoder = init_weights_decoder - trained_W_decoder
w_decoders_transposed = sae_model.decoder[0].weight.data.cpu().clone().t()
# in order to see that decoders weight is infact the same as
# encoders, lets transpose it again and reshape it.
# here I show both the encoders, weight and our decoders weight
# transposed!
print(trained_W_encoder.shape)
print(w_decoders_transposed.shape)
w_decoders_transposed = w_decoders_transposed.view(sae_model.encoder[0].out_features, 1, 28, 28)
# note that the decoder weights (in terms of original data) will be smoothed encoders weights
# (also in terms of original data)
# info from : https://medium.com/@SeoJaeDuk/arhcieved-post-personal-notes-about-contractive-auto-encoders-part-1-ef83bce72932
# end of the page, in the ppt slide image
print(init_weights_encoder.shape)
visualize_grid2(init_weights_encoder, 'Initial weights')
visualize_grid2(trained_W_encoder, 'Trained weights(Encoder)')
visualize_grid2(w_diff_encoder, 'weights diff (Encoder)')
visualize_grid2(trained_W_decoder,'Trained Weights (Decoder)')
visualize_grid2(w_decoders_transposed,'Trained Weights (Decoder-transposed)')
# the black shows negative values, and white show positive values
# and the gray shows zero values.
# we start from a high positive and high negative values in our initial
# weights. and then after training and imposing sparsity we can see that
# we are mostly seeing gray colors which indicate the values are zero!
# and that is what we were after!
# if you look at the w_diff, you can see that there are lots of high and
# low (negative) values as well. this is becsaue in order to make the
# weights have more reasonable weights, they had to be decreased/increased
#%%
# the cool thing about autoencoders are that we can use them to pretrain
# our weights on our data and then use that for classification or etc.
# this was actually done a lot back in the day until 2014/2015.
# in that era, the use of xavier initialization algorithm accompanied by
# batchnormalization killed the need for pretraining in this way. but lets
# see how we can do this if the needs be.
# its simple, just like finetuning, we may add/remove the layers we want
# here we will remove the decoder part and instead add a classifier
# lets remove the decoder
layers_before_decoder = list(sae_model.children())[:-1]
sae_model2 = nn.Sequential(*layers_before_decoder)
# since we created a sequential model here, we should add a new module
# using add_module. because if we simplt do sth like :
# sae_model2.classifier = nn.Linear(sae_model2[0].out_features, 10)
# classifier will be just an attribute, and for the forward pass we
# need to do sth like
# output=sae_model2.forward(input)
# output = sae_model2.classifier(output)
# so this is not ideal at all. therefore we do :
sae_model2.add_module('classifier', nn.Linear(sae_model2[0][0].out_features, 10))
print(sae_model2)
#%% now that we have our model built lets run trainng and pay attention
# what is the first accuracy we get
criterion = nn.CrossEntropyLoss()
epochs = 20
optimizer = torch.optim.SGD(sae_model2.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 5)
acc = 0.0
sae_model2 = sae_model2.to(device)
for e in range(epochs):
for i, (imgs, labels) in enumerate(dataloader_train):
imgs = imgs.to(device)
labels = labels.to(device)
imgs = imgs.view(imgs.size(0),-1)
output = sae_model2(imgs)
loss = criterion(output, labels)
_,class_idx = torch.max(output,dim=1)
acc += torch.mean((class_idx.view(*labels.shape) == labels).float())
optimizer.zero_grad()
loss.backward()
optimizer.step()
acc = acc/len(dataloader_train)
print(f'epoch: ({e}/{epochs}) acc: {acc*100:.4f} loss: {loss.item():.6f} lr: {scheduler.get_lr():.6f}')
scheduler.step()
# now you can try it without running the autoencoder training and
# see how it performs.
# Important note :
# There is a difference between sparsity on parameter and sparsity on representation.
# Sparse Autoencoder proposed by Andrew NG is able to learn a sparse representation
# and it is well known that l1 regularization encourages sparsity on parameters.
# They are different lets explain this in more detail!
# Notes:
# For imposing the sparsity constraint instead of l1 norm, we can
# also use KL divergance the principle is the same, where we took
# the average of the activations at each layer that we want their
# weights to be sparse, this time we calculate
# the kl-loss which is like this :
# def kl_divergence(p, p_hat):
# funcs = nn.Sigmoid()
# p_hat = torch.mean(funcs(p_hat), 1)
# p_tensor = torch.Tensor([p] * len(p_hat)).to(device)
# return torch.sum(p_tensor * torch.log(p_tensor) - p_tensor * torch.log(p_hat) + (1 - p_tensor) * torch.log(1 - p_tensor) - (1 - p_tensor) * torch.log(1 - p_hat))
# finally this was a simple autoencoder, we can have several layers
# and also you can use batchnormalization, etc for your deep autoencoders as well
#%%
# -VAE (Variational Autoencoders)
# -Creating MNIST Like digits
# -The Reparametrization Trick
# Variational Autoencoders (VAEs) have one fundamentally unique property that
# separates them from vanilla autoencoders, and it is this property that makes
# them so useful for generative modeling: their latent spaces are, by design,
# continuous, allowing easy random sampling and interpolation.
# It achieves this by doing something that seems rather surprising at first:
# making its encoder not output an encoding vector of size n, rather, outputting
# two vectors of size n: a vector of means, μ, and another vector of standard
# deviations, σ
# They form the parameters of a vector of random variables of length n, with
# the i-th element of μ and σ being the mean and standard deviation of the i-th
# random variable, X_i, from which we sample, to obtain the sampled encoding
# which we pass onward to the decoder:
# This stochastic generation means, that even for the same input, while the mean
# and standard deviations remain the same, the actual encoding will somewhat vary
# on every single pass simply due to sampling.
# read more : https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
# There are other resouces for this as well. its highly recommened to read them:
# https://www.jeremyjordan.me/variational-autoencoders/
# https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
# https://www.youtube.com/watch?v=uaaqyVS9-rM
# http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
# https://www.reddit.com/r/MLQuestions/comments/dl7mya/a_few_more_questions_about_vaes/
# we'll also have an example concerning words(in NLP domain) and see how we can
# leverage VAEs in that domain as well. for now lets see how we can implement this
# for vision domain. i.e. on mnist dataset
# note:
# For variational autoencoders, the encoder model is sometimes referred to as
# the 'recognition model' whereas the decoder model is sometimes referred to as
# the 'generative model'.
# if you havent read the links I gave you, go read them all. each single one of them
# will help you grasp one aspect very good!
#
# now lets define our VAE model .
class VAE(nn.Module):
def conv(self, in_dim, out_dim, k_size=3, stride=2, padding=1, batch_norm=True, bias=False):
return nn.Sequential(nn.Conv2d(in_dim, out_dim, k_size, stride, padding, bias=bias),
nn.BatchNorm2d(out_dim) if batch_norm else nn.Identity(),
nn.ReLU())
def deconv(self, in_dim, out_dim, k_size=3, stride=2, padding=1, batch_norm=True, bias=False):
return nn.Sequential(nn.ConvTranspose2d(in_dim, out_dim, k_size, stride, padding, bias=bias),
nn.BatchNorm2d(out_dim) if batch_norm else nn.Identity(),
nn.ReLU())
def __init__(self, embedding_size=100):
super().__init__()
self.embedding_size = embedding_size
# our encoder will give two vectors one for μ and another for σ.
# using these two parameter, we sample our z representation vector
# which is used by the decoder to reconstruct the input.
# So we can say that The encoder ‘encodes’ the data which is 784-dimensional
# into a latent (hidden) representation space z, which is much less than 784
# dimensions. This is typically referred to as a ‘bottleneck’ because the
# encoder must learn an efficient compression of the data into this
# lower-dimensional space. Let’s denote the encoder qθ(z∣x).
# We note that the lower-dimensional space is stochastic:
#>> the encoder outputs parameters to qθ(z∣x), which is a Gaussian probability
# density.
# We can sample from this distribution to get noisy values of the
# representations z .
self.fc1 = nn.Linear(28*28, 512)
self.encoder = nn.Sequential(self.conv(3,768),
self.conv(768,512),
self.conv(512,256),
nn.MaxPool2d(2,2),#16
self.conv(256,128),
self.conv(128,64),
nn.MaxPool2d(2,2),#8
self.conv(64, 32),
nn.MaxPool2d(2,2),#4
self.conv(32, 16),
nn.MaxPool2d(2,2),#2x2
self.conv(16, 8),
nn.MaxPool2d(2,2),#1x1
)
self.fc1_mu = nn.Linear(8, self.embedding_size) # mean
# we use log since we want to prevent getting negative variance
self.fc1_std = nn.Linear(8, self.embedding_size) #logvariance
# our decoder will accept a randomly sampled vector using
# our mu and std.
# The decoder is another neural net. Its input is the representation z,
# it outputs the parameters to the probability distribution of the data,
# and has weights and biases ϕ. The decoder is denoted by pϕ(x∣z).
# Running with the handwritten digit example, let’s say the photos are
# black and white and represent each pixel as 0 or 1.
# The probability distribution of a single pixel can be then represented
# using a Bernoulli distribution. The decoder gets as input the latent
# representation of a digit z and outputs 784 Bernoulli parameters,
# one for each of the 784 pixels in the image.
# The decoder ‘decodes’ the real-valued numbers in z into 784 real-valued
# numbers between 0 and 1. Information from the original 784-dimensional
# vector cannot be perfectly transmitted, because the decoder only has
# access to a summary of the information
# (in the form of a less-than-784-dimensional vector z).
# How much information is lost? We measure this using the reconstruction
# log-likelihood logpϕ(x∣z) whose units are nats. This measure tells us how
# effectively the decoder has learned to reconstruct an input image x given
# its latent representation z.
self.decoder = nn.Sequential(nn.Linear(self.embedding_size, 8*1*1),
deconv(8, 768,kernel_size=4,stride=2),
deconv(768,512,kernel_size=4,stride=2),
deconv(512, 256 ,kernel_size=4,stride=2),
deconv(256,128,kernel_size=4,stride=2),
deconv(128,3,kernel_size=4,stride=2),
# deconv(64,32,kernel_size=4,stride=2),
# deconv(32,3,kernel_size=4,stride=2),
nn.Sigmoid())
# self.decoder = nn.Sequential( nn.Linear(self.embedding_size, 512),
# nn.ReLU(),
# nn.Linear(512, 28*28),
# # in normal situations we wouldnt use sigmoid
# # but since we want our values to be in [0,1]
# # we use sigmoid. for loss we will then have
# # to use, plain BCE (and specifically not BCEWithLogits)
# nn.Sigmoid())
# Rather than directly outputting values for the latent state as we would
# in a standard autoencoder, the encoder model of a VAE will output
# "parameters(mean μ,variance σ) describing a distribution for each dimension in
# the latent space".
# Since we're assuming that our prior follows a normal distribution, we'll output
# two vectors describing the mean and variance of the latent state distributions.
# If we were to build a true multivariate Gaussian model, we'd need to define a
# covariance matrix describing how each of the dimensions are correlated.
# However, we'll make a simplifying assumption that our covariance matrix only
# has nonzero values on the diagonal, allowing us to describe this information
# in a simple vector.