Skip to content

BatchNormalization

Mark Hillebrand edited this page Aug 25, 2016 · 18 revisions
BatchNormalization(input, scale, bias, runMean, runVariance, spatial,
                   normalizationTimeConstant = 0, blendTimeConstant = 0,
                   epsilon = 0.00001,
                   useCntkEngine = true, tag='')

Note: CNTK's implementation of batch normalization relies on cuDNN and is only fully implemented for GPU. Only inference can be done on the CPU.

BatchNormalization implements the technique described in paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Sergey Ioffe, Christian Szegedy). In short, it normalizes layer outputs for every minibatch for each output (feature) independently and applies affine transformation to preserve representation of the layer. That is, for layer input:

m = mean(input)
var = variance(input)
input_norm = (input - mean)/sqrt(epsilon + var)
output = gamma * input_norm + beta

where gamma and beta are trainable parameters (represented as Parameter).

mean and variance are estimated from training data. In the simplest case, they are the mean and variance of the current minibatch during training. In inference, a long-term estimate is used instead. The long-term estimates are a low-pass-filtered version of the minibatch statistics, with the time constant (in samples) given by the normalizationTimeConstant parameter. (CNTK also allows to use a MAP estimate during training, taking the running long-term estimate as a prior. The weight of the prior is controlled by the blendTimeConstant parameter. However, this has not been found useful so far in our experiments.)

BatchNormalization() has the following parameters:

  • input is the input of the batch normalization node
  • scale is a Parameter that stores scale vector (gamma term in the equation above).
  • bias is a Parameter that stores bias vector (beta term). scale and bias must have the same dimensions which must be equal to the input dimensions in case of spatial = false or number of output convolution feature maps in case of spatial = true.
  • runMean is the running mean which is used during evaluation phase and might be used during training as well. It is represented as a Parameter with the same dimensions as scale and bias.
  • runVariance is the running variance. It is represented as a Parameter with the same dimensions as scale and bias.
  • spatial is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map.
  • normalizationTimeConstant is the time constant which is used to compute running average of mean and variance. Value 0 (default) means there will be no exponential smoothing and running mean/variance will always have values computed for the last seen minibatch. Value 1#INF (infinity) means running values are "frozen" (i.e. will not be updated). Depending on the dataset and network configuration, different values can be used. For example, for MNIST dataset you can set it to 1024 and for speech datasets to number of frames corresponding to 24 hour period. The constant can also be set globally (in .cntk config file) using batchNormalizationTimeConstant parameter, for example: batchNormalizationTimeConstant=0:1024
  • blendTimeConstant is the time constant which allows to specify how much of running mean/var should be "blended" into mean/var of the current minibatch. Value 0 (default) means no blending will happen and only the current minibatch statistics will be used. Value 1#INF (infinity) means only running mean/var will be used (this is used, for example, in evaluation phase). For example, you can start with 0, then set it to half of the size of minibatch and then set it to infinity after several epochs. This can be done using .cntk (config) file batchNormalizationBlendTimeConstant option: batchNormalizationBlendTimeConstant=0:32*10:1#INF. Note that the cuDNN engine (cf. useCntkEngine) only supports 0 and 1#INF.
  • epsilon is a conditioner constant added to the variance when computing the inverse standard deviation.
  • useCntkEngine is a boolean flag that specifies which batch normalization implementation to use: CNTK or cuDNN-based. Certain options are only supported for the CNTK engine, but the cuDNN engine is more performant.

For more information about time constants and exponential smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing#Time_Constant

Note that for evaluation stage CNTK will set time constants automatically, users do not have to change anything to switch between the stages.

Clone this wiki locally