I thought it'd be fun to do this so here we are.
Figuring out the convolution was quite simple and just involved adapting the valid cross-correlation formula which is sliding the kernel on top of the input, multiplying the adjacent values, and summing them up. To get the convolution, it involved rotating the kernel 180 degrees. Thus, finding the convolution looked like
Now the crux of the CNN—the convolutional layer. This involved taking in a 3 dimensional block of data as inputs (depth being 3). The kernels are also a 3 dimensional block in this case, spanning the full depth of the input. Something neat is that we can have multiple kernels, all of which extend the depth of the input. Each kernel contains a bias matrix that has the same shape as the output. Then, the layer would produce a 3 dimensional block of data as the output. Computing the output involved taking the cross-correlation with the input data and summing this up with the bias. That process is repeated with each kernel. We use the following formula for calculating the outputs:
To update the kernels and biases, we need to compute their gradients. We're given the derivative of E,
Note
To be honest, this is where I got really lost and I definitely will be revisiting this later to better understand what's going on here. This is also the core element of computer vision algorithms that are using deep learning today so it's pretty important to understand this part.
So this layer is inherited from the base layer class. The class looks something like this:
class Reshape(Layer):
def __init__(self, input_shape, output_shape):
self.input_shape = input_shape
self.output_shape = output_shape
def forward(self, input):
return np.reshape(input, self.output_shape)
def backward(self, output_gradient, learning_rate):
return np.reshape(output_gradient, self.input_shape)
The constructor takes in the shape of the input and output. The forward method reshapes the input to the output shape. The backward method reshapes the output to the input shape. Not too much going on here.
We're given a vector,
The goal is to compute the derivative of E with respect to the output. Upon plugging
Also, I added a small epsilon value that prevents log(0) and division by 0. After converting this to code, it looks something like this:
import numpy as np
def binary_cross_entropy(y_true, y_pred):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def binary_cross_entropy_prime(y_true, y_pred):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return ((1 - y_true) / (1 - y_pred) - y_true / y_pred) / np.size(y_true)
The sigmoid activation takes any real number and outputs a value between 0 and 1. This is particularly useful for binary classification problems where the output is interpreted as a probability. The sigmoid activation is defined as
The derivative is
And the implementation looks like this:
import numpy as np
from activation import Activation
class Sigmoid(Activation):
def __init__(self):
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_prime(x):
s = sigmoid(x)
return s * (1 - s)
super().__init__(sigmoid, sigmoid_prime)
MNIST is a data set of handwritten digits (0-9). The goal of this CNN is to classify each of these images into a number. We load the MNIST dataset from the keras library like so:
def preprocess_data(x, y, limit):
zero_index = np.where(y == 0)[0][:limit]
one_index = np.where(y == 1)[0][:limit]
all_indices = np.hstack((zero_index, one_index))
all_indices = np.random.permutation(all_indices)
x, y = x[all_indices], y[all_indices]
x = x.reshape(len(x), 1, 28, 28)
x = x.astype("float32") / 255
y = np_utils.to_categorical(y)
y = y.reshape(len(y), 2, 1)
return x, y
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = preprocess_data(x_train, y_train, 100)
x_test, y_test = preprocess_data(x_test, y_test, 100)
First, we get the indices of images representing a zero or one. Then, we stack the arrays of numbers together and shuffle them. Then, we extract only these images from the indices. Then, we reshape each image from 28x28 pixels to a 3D block of 1x28x28 pixels. This is because our convolutional layer takes in a 3D block of data with the depth as the first dimension. The images contain numbers from 0 to 255, we normalize the input by dividing each input by 255. For the output vector, we use another util from keras called to_categorical
which will create a one-hot encoded vector from a number. Essentially, reshape
because the dense layer takes in this type of input.
FINALLY our network looks something like this:
network = [
Convolutional((1, 28, 28), 3, 5),
Sigmoid(),
Reshape((5, 26, 26), (5 * 26 * 26, 1)),
Dense(5 * 26 * 26, 100),
Sigmoid(),
Dense(100, 2),
Sigmoid()
]
We then define our epochs and learning rate. I used values 20
and 0.1
respectively.
Now for training, it looks quite similar to building a regular neural network except we are using the binary cross entropy loss in this.
error += binary_cross_entropy_loss(y, output)
grad = binary_cross_entropy_loss_prime(y, output)
python3 xor.py
This was super fun to build and I learned a lot. Thanks to The Independent Code and his extremely informative video which I followed and adapted.