Skip to content

BatchNormalization

Frank Seide edited this page Aug 28, 2016 · 18 revisions
BatchNormalization(input, scale, bias, runMean, runVariance, spatial,
                   normalizationTimeConstant = 0, blendTimeConstant = 0,
                   epsilon = 0.00001,
                   useCntkEngine = true, tag='')

Parameters

  • input is the input of the batch normalization node
  • scale is a Parameter that stores scale vector (gamma term in the equation above).
  • bias is a Parameter that stores bias vector (beta term). scale and bias must have the same dimensions which must be equal to the input dimensions in case of spatial = false or number of output convolution feature maps in case of spatial = true.
  • runMean is the running mean which is used during evaluation phase and might be used during training as well. It is represented as a Parameter with the same dimensions as scale and bias.
  • runVariance is the running variance. It is represented as a Parameter with the same dimensions as scale and bias.
  • spatial is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map.
  • normalizationTimeConstant (default 0): time constant for computing running average of mean and variance as a low-pass filtered version of the batch statistics. Note: The default is not typically what you want.
  • blendTimeConstant (default 0): allows to smooth batch estimates with the running statistics
  • epsilon is a conditioner constant added to the variance when computing the inverse standard deviation.
  • useCntkEngine (default: true): set this to false to select the GPU-only CuDNN implementation

Return value

The batch-normalized input.

Description

BatchNormalization implements the technique described in paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Sergey Ioffe, Christian Szegedy). In short, it normalizes layer outputs for every minibatch for each output (feature) independently and applies affine transformation to preserve representation of the layer. That is, for layer input:

m = mean(input)
var = variance(input)
input_norm = (input - mean)/sqrt(epsilon + var)
output = gamma * input_norm + beta

where gamma and beta are trainable parameters (represented as Parameter).

mean and variance are estimated from training data. In the simplest case, they are the mean and variance of the current minibatch during training. In inference, a long-term estimate is used instead.

The long-term estimates are a low-pass-filtered version of the minibatch statistics, with the time constant (in samples) given by the normalizationTimeConstant parameter. A value of 0 means there will be no exponential smoothing and running mean/variance will always be equal to those of the last seen minibatch. This is often undesirable. Instead, it is recommended to use a value of a few thousand here. The BatchNormalizationLayer{} wrapper has a default of 5000.

For more information about time constants and exponential smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing#Time_Constant

Because minibatch statistics can be noisy, CNTK also allows to use a MAP (maximum-a-posteriori) estimate during training, where the running long-term estimate is taken as the prior. The weight of the prior is controlled by the blendTimeConstant parameter. However, this has not been found useful so far in our experiments.

Note that during inference, CNTK will set both time constants automatically such that only use the existing running mean is used, and that it is not updated. There is no explicit action needed by the user.

CuDNN implementation

By default, this primitive uses a CNTK implementation which works with both GPUs and CPUs. You can choose to use the CuDNN implementation, which is more performant. Note, however, that the CuDNN implementation does not support all options, and cannot run without GPUs.

Clone this wiki locally