Tutorial2

Table of Contents

Introduction
The MNIST Data
- Getting the Data
Some Important CNTK Concepts
- CNTK Macros
- Model Editing Language
Start Shallow: One Hidden Layer Neural Network
Go Deep: Convolutional Neural Networks (CNNs)
Improve Training: CNN with Batch Normalization

Introduction

This is the second part of the CNTK tutorial where we will start using CNTK more to its full potential. We will go deep! This tutorial assumes that you have already gone through the first part, and thus you are familiar with basic CNTK/ML concepts such as logistic regression and softmax. In the first tutorial we built models to solve simple binary and multi-class classification problems. Though those models achieved good accuracy, they will not perform as well on harder real-world problems. One principal reason is that the decision boundaries between the classes are not always linear. In this tutorial, we will learn to build more complex models, namely, neural networks and convolutional neural networks. We will build an image classification system using the MNIST dataset as our benchmark.

The MNIST Data

MNIST is an image dataset of handwritten digits. It is divided into a training set of 60,000 examples, and a test set of 10,000 examples. This dataset is a subset of the original data from NIST, pre-processed and published by LeCun et al. For more details please refer to this page. The MNIST dataset has become a standard benchmark for machine learning methods because it is real-world data, yet it is simple and requires minimal efforts in pre-processing and formatting. Each instance of the data consists of a 28x28 pixel image representing one digit, with a label between 0 and 9. Here are some examples:

MNIST Examples

Getting the Data

We have put up a Python script that easily fetches and prepares the data for CNTK consumption. Download it here: mnist_convert.py.

Some Important CNTK Concepts

Before building the neural networks for written digit image recognition using the MNIST dataset, we will go through two important concepts of CNTK: (1) Macros; and (2) the Model Editing Language (MEL). If you are interested in additional information on any of these topics, please refer to the CNTK book for more details.

CNTK Macros

Macros cut down on the verbosity of defining networks, increase code re-use, and help to prevent errors. Plus, defining macros in CNTK is easy! They can be defined in-line or as a block. Macros can have parameters, and can be shared across multiple configuration files. The variables and parameters used inside the macros are local to the macro but are accessible externally with dot operators (e.g. macro.V1). The return value of a block macro is defined by a local macro variable that has the same name as the macro. If no variable matches, the last variable in the macro will be returned. Other macros can be called from within a macro, but recursion is not supported. Here is an example:

# Block macro to define a layer with a sigmoid activation function
DNNSigmoidLayer(inDim, outDim, x, parmScale) = [
    W = LearnableParameter(outDim, inDim, init="uniform", initValueScale=parmScale) 
    b = LearnableParameter(outDim, 1,     init="uniform", initValueScale=parmScale) 
    t = Times(W, x)
    z = Plus(t, b)
    y = Sigmoid(z)
]

# In-line macro for a ReLU unit
RFF(x1, w1, b1) = RectifiedLinear(Plus(Times(w1, x1), b1))

Model Editing Language

The model editing language (MEL) of CNTK provides a means to modify both the structure and the model parameters of an existing trained network using a set of provided commands. It provides a number of functions to modify the network and can use the network description language (NDL) to define new elements. MEL allows users to, for example, train a network with one configuration and later use it as part of another network designed for another purpose. To use MEL, you need to use the edit command in the configuration file (we will elaborate on that later on). Below is an MEL script that loads an existing network, normalizes the input, and then saves the new model.

# Load a trained model
model1 = LoadModel("c:\models\mymodel.dnn", format=cntk) 

# Set it as the default model
SetDefaultModel(model1)

# Create an additional hidden layer 
Copy(L3.*, L4.*, copy="all")

# Hook up the  new layer 
SetInputValue(L4.*.T, 1, L3.RL) # Layer 3 output to Layer 4 input 
SetInputValue(CE.*.T, 1, L4.RL) # Layer 4 output to Top layer input

# Add mean variance normalization using in-line NDL 
meanVal = Mean(features)
invstdVal = InvStdDev(features) 
inputVal = PerDimMeanVarNormalization(features,meanVal, invstdVal)

# Make the features input now take the normalized input 
SetInputValue(L1.BFF.FF.T, 1, inputVal)

# Save the model 
SaveModel("c:\models\mymodel4HiddenWithMeanVarNorm.cn")

Start Shallow: One Hidden Layer Neural Network

Let's get back to the task at hand: classifying images of hand-written digits. To do so, we will build our first neural network with CNTK. Starting simple, our network will only have one hidden layer.

Neural Network vs. Softmax Regression

We saw in the previous tutorial that softmax regression can learn to separate data with more than two classes. However, the separation boundaries are linear. What if those boundaries were trickier? In that case, we could distort the feature space in a way to make the data linearly separable, and this is what a hidden layer can do for us. So basically, we take our softmax regression solution and plug in a hidden layer connected to the network’s inputs. Such a layer will learn to apply a feature mapping that projects the data into a space where it is (hopefully) linearly separable. Then, the next layer will receive an easier problem to deal with using its linear decision boundaries.

If we run softmax regression (from the previous tutorial) on MNIST problem we get an error rate of 7.5%, we will show that with one hidden layer the error can go down to 2.39%.

The Network Definition

First, we define our features and labels. Note that we apply a scaling of (1.0 / 256.0) on the features in order to have their values within the range [0-1]. Normalization helps SGD to converge faster with better predictive performance. Then, we specify the topology of our network which looks similar to the one we used in Part 1 of the tutorial, except that it has an additional layer, that is, the hidden layer DNNSigmoid. The layers are defined in a separate macro file, Macros.ndl.

# macros to include
load = ndlMnistMacros

# the actual NDL that defines the network
run = DNN

ndlMnistMacros = [
    featDim = 784							# 28x28 pixel images
    labelDim = 10

    features = InputValue(featDim)
    featScale = Constant(0.00390625)		# 1.0 / 256.0
    featScaled = Scale(featScale, features)
    labels = InputValue(labelDim)
]

DNN = [
    hiddenDim = 200

    # DNNSigmoidLayer and DNNLayer are defined in Macros.ndl
    h1 = DNNSigmoidLayer(featDim, hiddenDim, featScaled, 1)
    ol = DNNLayer(hiddenDim, labelDim, h1, 1)
    
    ce = CrossEntropyWithSoftmax(labels, ol)
    err = ErrorPrediction(labels, ol)
    
    # Special Nodes
    FeatureNodes = (features)
    LabelNodes = (labels)
    CriterionNodes = (ce)
    EvalNodes = (err)
    OutputNodes = (ol)
]

SGD Parameters

The SGD (Stochastic Gradient Descent) block tells CNTK how to optimize the network to find the best parameters. This includes information about mini-batch size (so the computation is more efficient), the learning rate, and how many epochs to train. Here is the SGD block for our example:

    SGD = [
        epochSize = 60000
        minibatchSize = 32
        learningRatesPerMB = 0.1
        momentumPerMB = 0
        maxEpochs = 30
    ]

Below is a list of the most common parameters used with the SGD block:

epochSize: the number of samples to use in each epoch. An intermediate model and other check point information are saved for each epoch. When set to 0 the whole dataset size is used.
minibatchSize: the minibatch size. The default value is 256. You can use different values for different epochs, e.g., 128*2:1024 means using minibatch size of 128 for the first two epochs and then 1024 for the rest.
learningRatesPerMB: the learning rates per epoch. You can use different values for different epochs, e.g., 0.8*10:0.2 means use the learning rate 0.8 for the first 10 epochs and then 0.2 for the rest.
learningRatesPerSample: the learning rates per sample per epoch. If you want your learning rate to vary in function of the minibatch size, you can use learningRatesPerSample instead of learningRatesPerMB. This will automatically increase the learning rate for the minibatch when the minibatch size increases, that is, the value defined here will be multiplied by the minibatch size to get the learning rate. You can use different values for different epochs, e.g., 0.008*10:0.002 means use the learning rate 0.008 for the first 10 epochs and then 0.002 for the rest.
momentumPerMB: The default value is 0.9. Different values can be given to different epochs. It is important to note that CNTK has a particular behaviour when dealing with momentum, the learning rate is automatically further scaled by a factor of (1 – momentum).
momentumPerSample: similarly to learning rate, momentum can be defined on the sample level, also, different values can be given to different epochs.
maxEpochs: the maximum number of epochs to run.
dropoutRate: the dropout rate per epoch. The default value is 0.

Putting it all Together

Here are the configuration files you need to run this example:

Once you have cloned the CNTK repository you will have all the required files. You can then run the example from the Image/MNIST/Data folder using:

cntk configFile=../Config/01_OneHidden.cntk

or you can run it from any folder and specify the Data folder as the currentDirectory, e.g. running from the Image/MNIST folder using:

cntk configFile=Config/01_OneHidden.cntk currentDirectory=Data

The output folder will be created inside Image/MNIST/, and we get the test results on the console:

Final Results: Minibatch[1-625]: Samples Seen = 10000    
err: ErrorPrediction/Sample = 0.0239    ce: CrossEntropyWithSoftmax/Sample = 0.076812531

This model has an error of 2.39%. Not too bad for image recognition! But now let's build an even better model by using Convolutional Neural Networks...

Go Deep: Convolutional Neural Networks (CNNs)

As we have seen, a simple neural network can achieve impressive results on this task. However, these results are not all that good when compared to what is out there in the literature. But we can do much better if we introduce some concepts from CNN theory. So, let's start by introducing the main concept of CNNs, and then building such a network with CNTK.

CNNs: The Ingredients

A convolutional neural network is formed by stacking different types of layers in a certain order. Here we describe each of these layers briefly.

Convolutional Layer

A convolutional layer can capture local patterns thanks to the local connectivity of its units. Each layer has a one or more feature map (a.k.a. kernel). A feature map is basically a sliding window over sub-regions of the layer's inputs, where each application of the map results in one output. A feature map is computed by performing a dot product between its parameters and the corresponding inputs. Each feature map is called a depth slice. Here is a simple example of how the feature map is applied.

Conv Layer Example

Activation Function Layer

It is the non-linear activation function that is applied to each unit output. It comes after the convolutional layer. One of the most commonly used functions is the Rectified Linear Unit (ReLU), which is simply max(0, x). Its advantage over a sigmoid function is that it does not suffer from the vanishing gradient problem and therefore learning can be more efficient. Note that ReLU gradient at zero is zero.

MaxPooling Layer

MaxPooling is a layer placed after the activation function. It divides the input into a set of non-overlapping regions, where for each region it outputs the maximum activation value. The advantage of this is two-fold: (1) it reduces the number of parameters and thus helps controlling overfitting; and (2) it selects the salient activation values regardless of their location in the region, which helps training models that are more resilient to things like rotation / translation. Note that MaxPooling operates independently on each depth slice of the input. Here is an example (from Wikipedia) of max pooling with a window size of 2x2.

Attribution: By Aphex34 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

Dense Layer

Finally, after cascading several convolutional, activation function, and MaxPooling layers, a CNN will have one or more dense layers. Units in a dense layer have full connections to all activations in the previous layer, similar to regular neural networks.

Softmax Layer

We know this from the first part of the tutorial. The output is a probability distribution over the possible classes. Below is a chart of a CNN with two alternating convolution / activation and MaxPooling layers, one dense layer, and one softmax layer.

CNN

The Network Definition

Our CNN will have a bit more of a complex definition than our previous networks. Starting from the features, you will notice that we define each sample as a 28 x 28 matrix rather than a vector. This is because a CNN exploits local correlations in the image, thus, we need to preserve this information. Second, in addition to the layer we saw in the previous network, we define a cascade of convolutional and max pooling layers. We have two of each type. The core layer is ConvReLULayer which is defined as a macro in Macros.ndl. Here is what this macro looks like:

ConvReLULayer(inp, outMap, inWCount, kW, kH, hStride, vStride, wScale, bValue) = [
    convW = LearnableParameter(outMap, inWCount, init="uniform", initValueScale=wScale)
    convB = ImageParameter(1, 1, outMap, init="fixedValue", value=bValue, imageLayout=$imageLayout$)
    conv = Convolution(convW, inp, kW, kH, outMap, hStride, vStride, zeroPadding=false, imageLayout=$imageLayout$)
    convPlusB = Plus(conv, convB)
    act = RectifiedLinear(convPlusB)
]

The Convolution node takes the following parameters:

Convolution(w, image, kernelWidth, kernelHeight, outputChannels, 
			horizontalSubsample, verticalSubsample, 
			[zeroPadding=false, maxTempMemSizeInSamples=0, 
			imageLayout="HWC|cudnn"])

w: convolution weight matrix. It has the dimensions of [outputChannels, kernelWidth * kernelHeight * inputChannels]. If w’s dimensions are not specified (i.e., all are zeroes), they will be automatically set by CNTK using a depth-ﬁrst traversing pass.
image: the input image.
kernelWidth: width of the kernel
kernelHeight: height of the kernel
outputChannels: number of output channels
horizontalSubsample: subsamples (or stride) in the horizontal direction. In most cases this should be set to 1.
verticalSubsample: subsamples (or stride) in the vertical direction. In most cases this should be set to 1.
zeroPadding: [optional] specify whether the sides of the image should be padded with zeros. Default is false. When it’s true, the convolution window can move out of the image.
maxTempMemSizeInSamples: [optional] maximum amount of memory (in samples) that should be reserved to do matrix packing. Default is 0 which means the same as the input samples.
imageLayout: [optional] the storage format of each image. By default it’s HWC, which means each image is stored as [channel, width, height] in column major. If you use cuDNN to speed up training, you should set it to cudnn, which means each image is stored as [width, height, channel].

MaxPooling is a CNTK node type which takes the following parameters:

MaxPooling(m, windowWidth, windowHeight, stepW, stepH, imageLayout="HWC|cudnn")

m: input matrix
windowWidth: width of the pooling window
windowHeight: height of the pooling window
stepW: step (or stride) used in the width direction
stepH: step (or stride) used in the height direction
imageLayout: [optional] the storage format of each image. By default it’s HWC, which means each image is stored as [channel, width, height] in column major. If you use cuDNN to speed up training, you should set it to cudnn, which means each image is stored as [width, height, channel].

Finally, here is the full definition of our CNN that will learn to classify images of hand-written digits:

# macros to include
load = ndlMnistMacros

# the actual NDL that defines the network
run = DNN

ndlMnistMacros = [
    imageW = 28
    imageH = 28
    labelDim = 10

    features = ImageInput(imageW, imageH, 1, imageLayout=$imageLayout$)
    featScale = Constant(0.00390625)
    featScaled = Scale(featScale, features)
    labels = InputValue(labelDim)
]

DNN=[
    # conv1
    kW1 = 5
    kH1 = 5
    cMap1 = 16
    hStride1 = 1
    vStride1 = 1
    # weight[cMap1, kW1 * kH1 * inputChannels]
    # ConvReLULayer is defined in Macros.ndl
    conv1_act = ConvReLULayer(featScaled, cMap1, 25, kW1, kH1, hStride1, vStride1, 10, 1)

    # pool1
    pool1W = 2
    pool1H = 2
    pool1hStride = 2
    pool1vStride = 2
    pool1 = MaxPooling(conv1_act, pool1W, pool1H, pool1hStride, pool1vStride, imageLayout=$imageLayout$)

    # conv2
    kW2 = 5
    kH2 = 5
    cMap2 = 32
    hStride2 = 1
    vStride2 = 1
    # weight[cMap2, kW2 * kH2 * cMap1]
    # ConvReLULayer is defined in Macros.ndl
    conv2_act = ConvReLULayer(pool1, cMap2, 400, kW2, kH2, hStride2, vStride2, 10, 1)

    # pool2
    pool2W = 2
    pool2H = 2
    pool2hStride = 2
    pool2vStride = 2
    pool2 = MaxPooling(conv2_act, pool2W, pool2H, pool2hStride, pool2vStride, imageLayout=$imageLayout$)

    h1Dim = 128
    # DNNSigmoidLayer and DNNLayer are defined in Macros.ndl
    h1 = DNNSigmoidLayer(512, h1Dim, pool2, 1)
    ol = DNNLayer(h1Dim, labelDim, h1, 1)
    
    ce = CrossEntropyWithSoftmax(labels, ol)
    err = ErrorPrediction(labels, ol)

    # Special Nodes
    FeatureNodes = (features)
    LabelNodes = (labels)
    CriterionNodes = (ce)
    EvalNodes = (err)
    OutputNodes = (ol)
]

Putting it all Together

Here are the configuration files you will need to run this example:

Similar to the previous example, it runs using the following command:

cntk configFile=../Config/02_Convolution.cntk

The output folder will be created inside Image/MNIST/, and we get the test results on the console:

Final Results: Minibatch[1-625]: Samples Seen = 10000
err: ErrorPrediction/Sample = 0.0088    ce: CrossEntropyWithSoftmax/Sample = 0.029200574

With an error of 0.88%, this model (unsurprisingly) greatly outperforms the previous one. It took only 2-3 minutes and 15 epochs to train on a single GPU. It is worth mentioning that with a GPU the results can vary slightly from run to run due to some non-deterministic CUDA functions. Let us compare the loss per epoch for the three models, softmax regression, one hidden layer network and CNN. As expected, loss for the CNN drops at a faster rate and reaches lower level than the loss of the other two models.

Training Loss

Now let's check out a technique that helps us to reduce the training time without significantly compromising the predictive performance of the model.

Improve Training: CNN with Batch Normalization

In this section we will add to our network a widely used technique called Batch Normalization. It helps to make training much more efficient. We will give a brief introduction before demonstrating how to add it to our CNN.

Batch Normalization in a Nutshell

One problem with training deep neural networks is that the distribution of each layer's inputs changes during training because the parameters of the previous layers change. This causes the training to slow down. One common technique to address the problem is called Batch Normalization (BN). BN consists of normalizing layer inputs and performing the normalization for each training mini-batch. BN makes it possible to use much higher learning rates and allows us to be less careful about initialization. For more details, please refer to the paper here.

Disclaimer: Batch Normalization is currently only supported on GPU.

The Network Definition

We start from the CNN we just built and add a batch normalization node to each layer's macro. Also, we change the sigmoid layer to a ReLU layer as it gives a slight bump to the model's accuracy.

# macros to include
load = ndlMnistMacros

# the actual NDL that defines the network
run = DNN

ndlMnistMacros = [
    imageW = 28
    imageH = 28
    labelDim = 10

    features = ImageInput(imageW, imageH, 1, imageLayout=$imageLayout$)
    featScale = Constant(0.00390625)
    featScaled = Scale(featScale, features)
    labels = InputValue(labelDim)

    scValue = 1
    expAvg = 1
    
    convWScale = 10
    convBValue = 0
]

DNN = [
    # conv1
    kW1 = 5
    kH1 = 5
    cMap1 = 16
    hStride1 = 1
    vStride1 = 1
    # weight[cMap1, kW1 * kH1 * inputChannels]
    # ConvBNReLULayer is defined in Macros.ndl
    conv1 = ConvBNReLULayer(featScaled, cMap1, 25, kW1, kH1, hStride1, vStride1, convWScale, convBValue, scValue, expAvg)

    # pool1
    pool1W = 2
    pool1H = 2
    pool1hStride = 2
    pool1vStride = 2
    pool1 = MaxPooling(conv1, pool1W, pool1H, pool1hStride, pool1vStride, imageLayout=$imageLayout$)

    # conv2
    kW2 = 5
    kH2 = 5
    cMap2 = 32
    hStride2 = 1
    vStride2 = 1
    # weight[cMap2, kW2 * kH2 * cMap1]
    conv2 = ConvBNReLULayer(pool1, cMap2, 400, kW2, kH2, hStride2, vStride2, convWScale, convBValue, scValue, expAvg)

    # pool2
    pool2W = 2
    pool2H = 2
    pool2hStride = 2
    pool2vStride = 2
    pool2 = MaxPooling(conv2, pool2W, pool2H, pool2hStride, pool2vStride, imageLayout=$imageLayout$)

    h1Dim = 128
    fcWScale = 1
    fcBValue = 0
    # DnnBNReLULayer is defined in Macros.ndl
    h1 = DnnBNReLULayer(1568, h1Dim, pool2, fcWScale, fcBValue, scValue, expAvg)
    ol = DNNLayer(h1Dim, labelDim, h1, 1)
    
    ce = CrossEntropyWithSoftmax(labels, ol)
    err = ErrorPrediction(labels, ol)
    
    # Special Nodes
    FeatureNodes = (features)
    LabelNodes = (labels)
    CriterionNodes = (ce)
    EvalNodes = (err)
    OutputNodes = (ol)
]

The behaviour of a batch normalization node changes during testing. In order to apply this change in CNTK we use MEL to edit our trained model. As you may have many BN nodes in your network, the easiest way to switch all of them is to set a property, batchNormEvalMode, to true for every applicable node in a tree with root node CE. This means that for every node that has this property, the command will make the appropriate changes and leave other nodes unchanged. This is done with the file 03_ConvBatchNorm.mel as follows:

SetPropertyForSubTree(CE, batchNormEvalMode, true)

In the main configuration file we add an edit command to execute the model update:

CreateEvalModel=[    
    action=edit
    CurModel=$ModelDir$/03_ConvBatchNorm
    NewModel=$ModelDir$/03_ConvBatchNorm.Eval
    editPath=$ConfigDir$/03_ConvBatchNorm.mel
]

Putting it all Together

Here are the configuration files you need to run this example:

Similar to the previous example, it runs using the following command:

cntk configFile=../Config/03_ConvBatchNorm.cntk

The output folder will be created inside Image/MNIST/, and we get the test results on the console:

Final Results: Minibatch[1-313]: Samples Seen = 10000    
err: ErrorPrediction/Sample = 0.0096    ce: CrossEntropyWithSoftmax/Sample = 0.029642786

With Batch Normalization the network achieves a tiny 0.96% error rate after training for just 2 epochs (and less than 30 seconds)!

New Documentation Site

Iteration Plans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial2

Introduction

The MNIST Data

Getting the Data

Some Important CNTK Concepts

CNTK Macros

Model Editing Language

Start Shallow: One Hidden Layer Neural Network

Neural Network vs. Softmax Regression

The Network Definition

SGD Parameters

Putting it all Together

Go Deep: Convolutional Neural Networks (CNNs)

CNNs: The Ingredients

Convolutional Layer

Activation Function Layer

MaxPooling Layer

Dense Layer

Softmax Layer

The Network Definition

Putting it all Together

Improve Training: CNN with Batch Normalization

Batch Normalization in a Nutshell

The Network Definition

Putting it all Together

Clone this wiki locally