-
Notifications
You must be signed in to change notification settings - Fork 168
Updaters
An updater is a generalization of a gradient optimization step. Several updaters do perform gradient optimization, but there are others that do something different. e.g. multiplicative updates are used in NMF, and models based on quotients of accumulated data are updated by recomputing the quotient on moving averages. Updating is broken out from the generation of the gradient (or quotient pair) which happens in the model's update method. Updaters support a few methods:
init: to initialize any params and working storage for the updater. update: the basic update called by the learner on each minibatch. updateM: an update called at the end of a pass over the dataset. clear: clears working storage, called at the beginning of a dataset pass.
Used for batch mode inference in LDA, NMF and Gibbs sampling. Accumulates update data into numerator and denominator "accumulator" matrices during a pass over the dataset. At the end of the pass, it updates the model by computing the ratio of the accumulators.
The incremental version of the Batch Norm updater. Both numerator and denominator are updated using moving averages.
<math>M^{(t)} = \alpha U^{(t)} + (1-\alpha) M^{(t-1)} </math>
where <math>M^{(t)}</math> is the matrix at time <math>t</math>, <math>U^{(t)}</math> is the update at time <math>t</math>, and <math>\alpha</math> is defined as
<math>\alpha = \left( \frac{1}{t}\right)^p</math>
A value of <math>p=1</math> (set with the power
option to the updater) gives a uniform average of the updates up to time <math>t</math>. Smaller values of <math>p</math> weight recent data more heavily. A value of <math>p=0.5</math> mimics the temporal envelope of ADAGRAD. In practice values smaller than 0.5 often give best performance. The default value is 0.3 which works well with most of the models.
This updater just adds gradient updates to the model, using a decay schedule. It includes a few options:
lrate:FMat(1f): the learning rate, or scale factor applied to gradients at each step. texp:FMat(0.5f): the decay exponent. waitsteps = 2: number of steps to wait before starting decay. mask:FMat = null: a mask used to prevent updates to some values (e.g. constants in the model).