-
Notifications
You must be signed in to change notification settings - Fork 4.3k
BatchNormalization
BatchNormalization(input, scale, bias, runMean, runVariance, spatial,
normalizationTimeConstant = 0, blendTimeConstant = 0,
epsilon = 0.00001,
useCntkEngine = true, tag='')
-
input
is the input of the batch normalization node -
scale
is a Parameter that stores scale vector (gamma
term in the equation above). -
bias
is a Parameter that stores bias vector (beta
term).scale
andbias
must have the same dimensions which must be equal to theinput
dimensions in case ofspatial = false
or number of output convolution feature maps in case ofspatial = true
. -
runMean
is the running mean which is used during evaluation phase and might be used during training as well. It is represented as a Parameter with the same dimensions asscale
andbias
. -
runVariance
is the running variance. It is represented as a Parameter with the same dimensions asscale
andbias
. -
spatial
is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map. -
normalizationTimeConstant
(default 0): time constant for computing running average of mean and variance as a low-pass filtered version of the batch statistics. Note: The default is not typically what you want. -
blendTimeConstant
(default 0): allows to smooth batch estimates with the running statistics -
epsilon
is a conditioner constant added to the variance when computing the inverse standard deviation. -
useCntkEngine
(default: true): set this tofalse
to select the GPU-only CuDNN implementation
The batch-normalized input
.
BatchNormalization
implements the technique described in paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Sergey Ioffe, Christian Szegedy).
In short, it normalizes layer outputs for every minibatch for each output (feature) independently and applies affine transformation to preserve representation of the layer. That is, for layer input
:
m = mean(input)
var = variance(input)
input_norm = (input - mean)/sqrt(epsilon + var)
output = gamma * input_norm + beta
where gamma
and beta
are trainable parameters (represented as Parameter).
mean
and variance
are estimated from training data. In the simplest case, they are the mean and variance
of the current minibatch during training. In inference, a long-term estimate is used instead.
The long-term estimates are a low-pass-filtered version of the minibatch statistics, with the time constant
(in samples) given by the normalizationTimeConstant
parameter.
A value of 0
means there will be no exponential smoothing and running mean/variance
will always be equal to those of the last seen minibatch.
This is often undesirable.
Instead, it is recommended to use a value of a few thousand here.
The BatchNormalizationLayer{}
wrapper has a default of 5000.
For more information about time constants and exponential smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing#Time_Constant
Because minibatch statistics can be noisy,
CNTK also allows to use a MAP (maximum-a-posteriori) estimate during training,
where the running long-term estimate is taken as the prior.
The weight of the prior is controlled by the blendTimeConstant
parameter.
However, this has not been found useful so far in our experiments.
Note that during inference, CNTK will set both time constants automatically such that only use the existing running mean is used, and that it is not updated. There is no explicit action needed by the user.
By default, this primitive uses a CNTK implementation which works with both GPUs and CPUs. You can choose to use the CuDNN implementation, which is more performant. Note, however, that the CuDNN implementation does not support all options, and cannot run without GPUs.