Skip to content

Updaters

jcanny edited this page May 8, 2014 · 26 revisions

Overview

An updater is a generalization of a gradient optimization step. Several updaters do perform gradient optimization, but there are others that do something different. e.g. multiplicative updates are used in NMF, and models based on quotients of accumulated data are updated by recomputing the quotient on moving averages. Updating is broken out from the generation of the gradient (or quotient pair) which happens in the model's update method. Updaters support a few methods:

init: to initialize any params and working storage for the updater.
update: the basic update called by the learner on each minibatch. 
updateM: an update called at the end of a pass over the dataset. 
clear: clears working storage, called at the beginning of a dataset pass.

BatchNorm Updater

Used for batch mode inference in LDA, NMF and Gibbs sampling. Accumulates update data into numerator and denominator "accumulator" matrices during a pass over the dataset. At the end of the pass, it updates the model by computing the ratio of the accumulators.

IncNorm Updater

The incremental version of the Batch Norm updater. Both numerator and denominator are updated using moving averages.

<math>M^{(t)} = \alpha U^{(t)} + (1-\alpha) M^{(t-1)} </math>

where <math>M^{(t)}</math> is the matrix at time <math>t</math>, <math>U^{(t)}</math> is the update at time <math>t</math>, and <math>\alpha</math> is defined as

<math>\alpha = \left( \frac{1}{t}\right)^p</math>

A value of <math>p=1</math> (set with the power option to the updater) gives a uniform average of the updates up to time <math>t</math>. Smaller values of <math>p</math> weight recent data more heavily. A value of <math>p=0.5</math> mimics the temporal envelope of ADAGRAD. In practice values smaller than 0.5 often give best performance. The default value is 0.3 which works well with most of the models.

Clone this wiki locally