Neural Networks are complex architectures thaat use the concept of feedforward and backpropagation in order to optimize prediction by inferene using weights that are varied based on backpropagation.
The weights are randomly initialized in the beginning and the inputs are fed into the model. After a set of matrix operations, an output is obtained. This part of the model is called feedforward. Feedforward involves using the weights in order to make a prediction which can be compared with the actual prediction in order to make changes to the weights and bias accordingly.
The output from the model is compared with the actual output, and based on the difference that is measured using a desired metric, the weights and bias are changed, by using what is known as Gradient Descent, which hopes to get the model weights to converge to values that give the least possible error in the metric used for comparison between model predictions and the desired output.
In a logistic regression model described by n features and one output as shown, each of the features are weighted and added, and then the weighted sum is passed through an activation function that gives a probability of the input belonging to a specific class.
The expression for the output can be obtained as,
where the subscript 'i' represents the index of the sample of the input that is passed through the model, since the model has to train on several samples to learn.
The weights and the input can be written in the form of matrices in order to make the expression simpler,
The bias is a single scalar, since the number of outputs is 1. The matrix multiplication of the weights and inputs matrices gives a matrix with a single element, which is added to the bias and then passed through the activation function to get the output. The output can therefore be expressed as,
If there are several samples of data that is passed through the model, the input matrix needs to be extended to include all the samples of the data, and the output can be represented as a matrix consisting of the outputs of different samples. If the number of samples is r, the input matrix can be represented as,
which is a matrix with dimensions [n x r], with each column representing one sample of data, and each row representing the index of the input. Similarly, the output matrix can be represented as,
which is a matrix with dimensions [1 x r], with each column representing the output of one sample of data.
The weight matrix remains unchanged, as the same set of weights is used for different samples of the data. The bias matrix, however, is broadcasted in order to have the same dimension as the input.
which is a matrix with dimensions [1 x r], with each of the column having the same value for b as the same bias is used for different samples of data. Now, the output matrix can be expressed in terms of the weights matrix, inputs matrix and the bias matrix as,
Dimensionally speaking, the matrix multiplication of the weights and the inputs matrix gives a matrix of dimension [1 x r], and the bias matrix is of the same dimension which is added to get a matrix of the same dimension, and the activation function is applied to each element of the matrix individually to get the output matrix with the dimension [1 x r]
This is the vectorization of the model when the number of outputs is 1. When the number of outputs is more than 1, the outputs matrix has to be extended by increasing the number of rows, with each row corresponding to a distinct output of the model.
The output matrix when the number of outputs of the model is more than 1, say m, is given by,
where each column represents the outputs for a distinct sample of data, and each row represents the index of the output of the model. The weights matrix also needs to be extended as different sets of weights are used for different outputs of the model, and the bias matrix also needs to be extended as different bias is used for different outputs of the model, while the inputs matrix remains the same.
The weights matrix is given by a matrix with dimensions [m x n], where each row corresponds to a set of weights that is used to obtain a distinct output of the model, and each column corresponds to the set of weights that is used on a specific input index.
The bias matrix is given by a matrix with dimensions [m x r], where each row corresponds to the bias used to obtain a distinct output irrespective of the sample used, and each column corresponds to the bias used for different samples, which is the same for all the samples.
If there are several samples of data that is passed through the model, the input matrix needs to be extended to include all the samples of the data, and the output can be represented as a matrix consisting of the outputs of different samples. If the number of samples is r, the input matrix can be represented as,
The output matrix of the model can be expressed in terms of the weights matrix, inputs matrix and the bias matrix as,
Let the cost function be denoted by C. The objective is to find the variation of the cost with respect to the weights. The cost is calculated from the actual value of the output to be predicted, and the value predicted by the model.
Let the weighted sum of the inputs be given by z and the activated value of z be given by y. Therefore, the output of the model is y.
The above equation follows from the chain rule of derivatives. This expression has three terms,
- Derivative of the cost with respect to the output
- Derivative of the output with respect to the weighted sum of the inputs
- Derivative of the weighted sum of outputs with respect to the weight
If the cost function taken is the Mean Square Error function,
which is the sum of the squares of the difference of the actual and the predicted values, which is averaged over all the samples.
The derivative of the cost function with respect to the output is given by,
If the activation function taken is represented by σ, the value of y in terms of z is given by,
The derivative of the output (y) in terms of the weighted sum of the inputs (z) is given by,
The weighted sum of the inputs (z) in terms of the weights (w) is given by,
The derivative of the weighted sum of inputs (z) with respect to a specific weight is given by,
Therefore, the derivative of the weighted sum of inputs in terms of a specific weight is the input data point which is multiplied with that specific weight in the expression of the weighted sum of inputs.
Combining the three terms, and averaging over all the samples,
In order to find the variation of the cost with the bias,
The first two terms are the same as that of the derivative of the cost with respect to the weights. In order to find the third term,
Combining the three terms, and averaging over all the samples,
The variation of the cost with respect to the weights and the bias are given by,
Using these two equations, the gradient matrix of the weights and the bias can be found.