Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to gradients #417

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ julia:
- 1.4
- 1.5
- 1.6
env:
- JULIA_NUM_THREADS=1
- JULIA_NUM_THREADS=2

services:
- docker
Expand Down
1 change: 0 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ makedocs(
"Learning Generative Functions" => "ref/learning.md"
],
"Internals" => [
"Optimizing Trainable Parameters" => "ref/internals/parameter_optimization.md",
"Modeling Language Implementation" => "ref/internals/language_implementation.md"
]
]
Expand Down
6 changes: 3 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
Pages = [
"getting_started.md",
"tutorials.md",
"guide.md",
"guide.md"
]
Depth = 2
```
Expand All @@ -21,8 +21,8 @@ Pages = [
"ref/parameter_optimization.md",
"ref/inference.md",
"ref/gfi.md",
"ref/distributions.md"
"ref/extending.md",
"ref/distributions.md",
"ref/extending.md"
]
Depth = 2
```
3 changes: 2 additions & 1 deletion docs/src/ref/gfi.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ A generative function statically reports whether or not it is able to compute gr
The **trainable parameters** of a generative function are (unlike arguments and random choices) *state* of the generative function itself, and are not contained in the trace.
Generative functions that have trainable parameters maintain *gradient accumulators* for these parameters, which get incremented by the gradient induced by the given trace by a call to [`accumulate_param_gradients!`](@ref).
Users then use these accumulated gradients to update to the values of the trainable parameters.
Use [`get_parameters`](@ref) to obtain the full set of trainable parameters that a generative function uses (see [Optimizing Trainable Paramters](@ref) for more details).

### Return value gradient
The set of elements (either arguments, random choices, or trainable parameters) for which gradients are available is called the **gradient source set**.
Expand Down Expand Up @@ -371,5 +372,5 @@ has_argument_grads
accepts_output_grad
accumulate_param_gradients!
choice_gradients
get_params
get_parameters
```
7 changes: 0 additions & 7 deletions docs/src/ref/internals/parameter_optimization.md
Original file line number Diff line number Diff line change
@@ -1,7 +0,0 @@
# [Optimizing Trainable Parameters](@id optimizing-internal)

To add support for a new type of gradient-based parameter update, create a new type with the following methods defined for the types of generative functions that are to be supported.
```@docs
Gen.init_update_state
Gen.apply_update!
```
30 changes: 15 additions & 15 deletions docs/src/ref/learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,10 @@ end
```

Let's suppose we are training the generative model.
The first step is to initialize the values of the trainable parameters, which for generative functions constructed using the built-in modeling languages, we do with [`init_param!`](@ref):
The first step is to initialize the values of the trainable parameters, which for generative functions constructed using the built-in modeling languages, we do with [`init_parameter!`](@ref):
```julia
init_param!(model, :a, 0.)
init_param!(model, :b, 0.)
init_parameter!((model, :a), 0.0)
init_parameter!((model, :b), 0.0)
```
Each trace in the collection contains the observed data from an independent draw from our model.
We can populate each trace with its observed data using [`generate`](@ref):
Expand All @@ -76,29 +76,29 @@ for trace in traces
accumulate_param_gradients!(trace)
end
```
Finally, we can construct and gradient-based update with [`ParamUpdate`](@ref) and apply it with [`apply!`](@ref).
Finally, we can construct and gradient-based update with [`init_optimizer`](@ref) and apply it with [`apply_update!`](@ref).
We can put this all together into a function:
```julia
function train_model(data::Vector{ChoiceMap})
init_param!(model, :theta, 0.1)
init_parameter!((model, :theta), 0.1)
traces = []
for observations in data
trace, = generate(model, model_args, observations)
push!(traces, trace)
end
update = ParamUpdate(FixedStepSizeGradientDescent(0.001), model)
optimizer = init_optimizer(FixedStepGradientDescent(0.001), model)
for iter=1:max_iter
objective = sum([get_score(trace) for trace in traces])
println("objective: $objective")
for trace in traces
accumulate_param_gradients!(trace)
end
apply!(update)
apply_update!(optimizer)
end
end
```

Note that using the same primitives ([`generate`](@ref) and [`accumulate_param_gradients!`](@ref)), you can compose various more sophisticated learning algorithms involving e.g. stochastic gradient descent and minibatches, and more sophisticated stochastic gradient optimizers like [`ADAM`](@ref).
Note that using the same primitives ([`generate`](@ref) and [`accumulate_param_gradients!`](@ref)), you can compose various more sophisticated learning algorithms involving e.g. stochastic gradient descent and minibatches, and more sophisticated stochastic gradient optimizers.
For example, [`train!`](@ref) trains a generative function from complete data with minibatches.

## Learning from Incomplete Data
Expand Down Expand Up @@ -139,14 +139,14 @@ There are many variants possible, based on which Monte Carlo inference algorithm
For example:
```julia
function train_model(data::Vector{ChoiceMap})
init_param!(model, :theta, 0.1)
update = ParamUpdate(FixedStepSizeGradientDescent(0.001), model)
init_parameter!((model, :theta), 0.1)
optimizer = init_optimizer(FixedStepGradientDescent(0.001), model)
for iter=1:max_iter
traces = do_monte_carlo_inference(data)
for trace in traces
accumulate_param_gradients!(trace)
end
apply!(update)
apply_update!(optimizer)
end
end

Expand All @@ -160,14 +160,14 @@ end
Note that it is also possible to use a weighted collection of traces directly without resampling:
```julia
function train_model(data::Vector{ChoiceMap})
init_param!(model, :theta, 0.1)
update = ParamUpdate(FixedStepSizeGradientDescent(0.001), model)
init_parameter!((model, :theta), 0.1)
optimizer = init_optimizer(FixedStepGradientDescent(0.001), model)
for iter=1:max_iter
traces, weights = do_monte_carlo_inference_with_weights(data)
for (trace, weight) in zip(traces, weights)
accumulate_param_gradients!(trace, nothing, weight)
end
apply!(update)
apply_update!(optimizer)
end
end
```
Expand Down Expand Up @@ -209,7 +209,7 @@ Then, the traces of the model can be obtained by simulating from the variational
Instead of fitting the variational approximation from scratch for each observation, it is possible to fit an *inference model* instead, that takes as input the observation, and generates a distribution on latent variables as output (as in the wake sleep algorithm).
When we train the variational approximation by minimizing the evidence lower bound (ELBO) this is called amortized variational inference.
Variational autencoders are an example.
It is possible to perform amortized variational inference using [`black_box_vi`](@ref) or [`black_box_vimco!`](@ref).
It is possible to perform amortized variational inference using [`black_box_vi!`](@ref) or [`black_box_vimco!`](@ref).

## References

Expand Down
24 changes: 12 additions & 12 deletions docs/src/ref/modeling.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,7 @@ See [Generative Function Interface](@ref) for more information about traces.

A `@gen` function may begin with an optional block of *trainable parameter declarations*.
The block consists of a sequence of statements, beginning with `@param`, that declare the name and Julia type for each trainable parameter.
The Julia type must be either a subtype of `Real` or subtype of `Array{<:Real}`.
The function below has a single trainable parameter `theta` with type `Float64`:
```julia
@gen function foo(prob::Float64)
Expand All @@ -264,23 +265,22 @@ The function below has a single trainable parameter `theta` with type `Float64`:
end
```
Trainable parameters obey the same scoping rules as Julia local variables defined at the beginning of the function body.
The value of a trainable parameter is undefined until it is initialized using [`init_param!`](@ref).
After the definition of the generative function, you must register all of the parameters used by the generative function using [`register_parameters!`](@ref) (this is not required if you instead use the [Static Modeling Language](@ref)):
```julia
register_parameters!(foo, [:theta])
```
The value of a trainable parameter is undefined until it is initialized using [`init_parameter!`](@ref):
```julia
init_parameter!((foo, :theta), 0.0)
```
In addition to the current value, each trainable parameter has a current **gradient accumulator** value.
The gradient accumulator value has the same shape (e.g. array dimension) as the parameter value.
It is initialized to all zeros, and is incremented by [`accumulate_param_gradients!`](@ref).

The following methods are exported for the trainable parameters of `@gen` functions:
It is initialized to all zeros, and is incremented by calling [`accumulate_param_gradients!`](@ref) on a trace.
Additional functions for retrieving and manipulating the values of trainable parameters and their gradient accumulators are described in [Optimizing Trainable Parameters](@ref).
```@docs
init_param!
get_param
get_param_grad
set_param!
zero_param_grad!
register_parameters!
```

Trainable parameters are designed to be trained using gradient-based methods.
This is discussed in the next section.

## Differentiable programming

Given a trace of a `@gen` function, Gen supports automatic differentiation of the log probability (density) of all of the random choices made in the trace with respect to the following types of inputs:
Expand Down
85 changes: 67 additions & 18 deletions docs/src/ref/parameter_optimization.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,82 @@
# Optimizing Trainable Parameters

Trainable parameters of generative functions are initialized differently depending on the type of generative function.
Trainable parameters of the built-in modeling language are initialized with [`init_param!`](@ref).
## Parameter stores

Gradient-based optimization of the trainable parameters of generative functions is based on interleaving two steps:
Multiple traces of a generative function typically reference the same trainable parameters of the generative function, which are stored outside of the trace in a **parameter store**.
Different types of generative functions may use different types of parameter stores.
For example, the [`JuliaParameterStore`](@ref) (discussed below) stores parameters as Julia values in the memory of the Julia runtime process.
Other types of parameter stores may store parameters in GPU memory, in a filesystem, or even remotely.

- Incrementing gradient accumulators for trainable parameters by calling [`accumulate_param_gradients!`](@ref) on one or more traces.
When generating a trace of a generative function with [`simulate`](@ref) or [`generate`](@ref), we may pass in an optional **parameter context**, which is a `Dict` that provides information about which parameter store(s) in which to look up the value of parameters.
A generative function obtains a reference to a specific type of parameter store by looking up its key in the parameter context.

- Updating the value of trainable parameters and resetting the gradient accumulators to zero, by calling [`apply!`](@ref) on a *parameter update*, as described below.
If you are just learning Gen, and are only using the built-in modeling language to write generative functions, you can ignore this complexity, because there is a [`default_julia_parameter_store`](@ref) and a default parameter context [`default_parameter_context`](@ref) that points to this default Julia parameter store that will be used if a parameter context is not provided in the call to `simulate` and `generate`.
```@docs
default_parameter_context
default_julia_parameter_store
```

## Julia parameter store

Parameters declared using the `@param` keyword in the built-in modeling language are stored in a type of parameter store called a [`JuliaParameterStore`](@ref).
A generative function can obtain a reference to a `JuliaParameterStore` by looking up the key [`JULIA_PARAMETER_STORE_KEY`](@ref) in a parameter context.
This is how the built-in modeling language implementation finds the parameter stores to use for `@param`-declared parameters.
Note that if you are defining your own [custom generative functions](@ref #Custom-generative-functions), you can also use a [`JuliaParameterStore`](@ref) (including the same parameter store used to store parameters of built-in modeling language generative functions) to store and optimize your trainable parameters.

## Parameter update
Different types of parameter stores provide different APIs for reading, writing, and updating the values of parameters and gradient accumulators for parameters.
The `JuliaParameterStore` API is given below.
The API uses tuples of the form `(gen_fn::GenerativeFunction, name::Symbol)` to identify parameters.
(Note that most user learning code only needs to use [`init_parameter!`](@ref), as the other API functions are called by [Optimizers](@ref) which are discussed below.)

A *parameter update* reads from the gradient accumulators for certain trainable parameters, updates the values of those parameters, and resets the gradient accumulators to zero.
A paramter update is constructed by combining an *update configuration* with the set of trainable parameters to which the update should be applied:
```@docs
ParamUpdate
JuliaParameterStore
init_parameter!
increment_gradient!
reset_gradient!
get_parameter_value
get_gradient
JULIA_PARAMETER_STORE_KEY
```
The set of possible update configurations is described in [Update configurations](@ref).
An update is applied with:

### Multi-threaded gradient accumulation

Note that the [`increment_gradient!`](@ref) call is thread-safe, so that multiple threads can concurrently increment the gradient for the same parameters. This is helpful for parallelizing gradient computation for a batch of traces within stochastic gradient descent learning algorithms.

## Optimizers

Gradient-based optimization typically involves iterating between two steps:
(i) computing gradients or estimates of gradients with respect to parameters, and
(ii) updating the value of the parameters based on the gradient estimates according to some mathematical rule.
Sometimes the optimization algorithm also has its own state that is separate from the value of the parameters and the gradient estimates.
Gradient-based optimization algorithms in Gen are implemented by **optimizers**.
Each type of parameter store provides implementations of optimizers for standard mathematical update rules.

The mathematical rules are defined in **optimizer configuration** objects.
The currently supported optimizer configurations are:
```@docs
apply!
FixedStepGradientDescent
DecayStepGradientDescent
```

The most common way to construct an optimizer is via:
```julia
optimizer = init_optimizer(conf, gen_fn)
```
which returns an optimizer that applies the mathematical rule defined by `conf` to all parameters used by `gen_fn` (even when the generative function uses parameters that are housed in multiple parameter stores).
You can also pass a parameter context keyword argument to customize the parameter store(s) that the optimizer should use.
Then, after accumulating gradients with [`accumulate_param_gradients!`](@ref), you can apply the update with:
```julia
apply_update!(optimizer)
```

The `init_optimizer` method described above constructs an optimizer that actually invokes multiple optimizers, one for each parameter store.
To add support to a parameter store type for a new optimizer configuration type, you must implement the per-parameter-store optimizer methods:

## Update configurations
- `init_optimizer(conf, parameter_ids, store)`, which takes in an optimizer configuration object, and list of parameter IDs, and the parameter store in which to apply the updates, and returns an optimizer thata mutates the given parameter store.

- `apply_update!(optimizer)`, which takes in an a single argument (the optimizer) and applies its update rule, which mutates the value of the parameters in its parameter store (and typically also resets the values of the gradient accumulators to zero).

Gen has built-in support for the following types of update configurations.
```@docs
FixedStepGradientDescent
GradientDescent
ADAM
init_optimizer
apply_update!
```
For adding new types of update configurations, see [Optimizing Trainable Parameters (Internal)](@ref optimizing-internal).
1 change: 0 additions & 1 deletion docs/src/ref/selections.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,5 +55,4 @@ AllSelection
HierarchicalSelection
DynamicSelection
StaticSelection
ComplementSelection
```
3 changes: 0 additions & 3 deletions src/Gen.jl
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,6 @@ include("dynamic/dynamic.jl")
# static IR generative function
include("static_ir/static_ir.jl")

# optimization for built-in generative functions (dynamic and static IR)
include("builtin_optimization.jl")

# DSLs for defining dynamic embedded and static IR generative functions
# 'Dynamic DSL' and 'Static DSL'
include("dsl/dsl.jl")
Expand Down
Loading