-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to gradients #417
base: master
Are you sure you want to change the base?
Changes to gradients #417
Conversation
b173841
to
f832a15
Compare
f832a15
to
c045614
Compare
4030288
to
64ee5e0
Compare
Some of the changes in this PR (in particular the added support for multi-threaded gradient accumulation, and use of more in-place array operations) were motivated by this example, which applies reweighted wake sleep to learn a generative model of binarized MNIST digits. The other changes aim to reduce technical debt and improve extensibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcoct This is awesome! It'll be really great to have multithreaded SGD. LGTM, for the next breaking-changes release.
Some questions:
-
It seems that dynamic DSL traces store both a parameter store and a parameter context. Why? Isn't the parameter store just a particular field of the parameter context? Is this just to avoid looking it up each time you want to use it?
-
What do the keys of the parameter context represent, intuitively? It seems like maybe the answer is "parameter groups" -- a GF expects to be able to find a store for each parameter group it uses, and for each parameter group, it expects the store to support certain functionality. Is that right?
-
Is there some way of factoring an optimizer's logic (the OptimConf) from its implementation for each parameter store? I suppose an optimizer could be parametric in the type of parameter store -- but this would require some general interface for parameter stores, which it seems like is not the goal here.
-
If a GF uses at some point parameters that are not stored in the default Julia parameter store, does the user always need to pass in a parameter context providing them? Or is there some way the default parameter context can be extended? It's currently unclear to me exactly how we'd adapt Torch generative functions to this interface. One option is to have a dummy struct TorchParameterStore that actually stores nothing, and just looks up (gen_fn, id) inside the gen_fn object itself. Or is it explicitly a goal of this PR that generative functions should never themselves store parameters? If so, an alternative is to figure out a different way of constructing the Torch models so that the parameters really do live in separate PyObjects, and can be swapped out at will; presumably there is a Torch way to do this, but it may make it less natural to construct Torch models.
-
Do all parameter stores need to support multithreading, or do we want a way for a parameter store to mark itself as not supporting multithreading?
-
For now, I think this is fine, but I wonder about the design choice to have only one parameter store for each key? What if I want the top-level GF to use one Julia parameter store, but further down, a different one? (Not sure why I'd want this, to be fair. But I can see it being one pattern of implementation for Torch / TF GF's -- where each GF uses a different parameter store.) More generally, a parameter context is weird in that it is one of the few Gen data structures that is not hierarchically organized according to a GF's call tree. Is this intentional long-term, or a short-term simplification?
-
The name
register_parameters!
makes me think that calling it multiple times will register all the parameters that I pass in. But in fact, it will register only the most recent set of parameters. Maybe we can either change the behavior or the name? ("set_parameters!"?) -
I wonder if there is some way to avoid users having to register all parameters that a generative function might use? Can we do something akin to
AllSelection
? (I guess the user has to initialize all the parameters anyway... so maybe this doesn't help much.)
Yes. |
The original motivation is that a given generative function may use parameters that are managed by different runtime systems (e.g. Julia in-memory, PyTorch, etc.), and that each runtime system exposes a parameter store. I think it's possible the same functionality can be potentially used for other purposes, but I'm not sure. |
@alex-lew Do you mean, for each optimizer configuration, storing some symbolic expression representing the mathematical update and having each parameter store generate its code from that expression? |
No, not all parameter stores need to support multithreading. We could define a |
This new interface allows for a generalization of the previous behavior. Previously, each generative function was tied to a single parameter store. It is possible to reproduce the previous behavior by creating a separate parameter store for each generative function (although I don't see why one would want to do this). Note that this new more flexible interface lets you have two copies of the generative function that use different parameters.
Can we start with there always being a single PyTorchParameterStore instance (automatically constructed by PyTorch) for the PyTorch runtime that is running in the process, and have it store references to the parameters of the form (gen_fn, id)?
The goals of this aspect of the PR were to allow for the possibility of there being multiple versions of a single parameter value, for the possibility of two generative functions to directly use the same parameter, and to make it easier to add support for checkpoint and load functionality for parameters. It seems like if a parameter store chooses to have each generative function store its own copy of the parameters then those goals might be hard to achieve, but it's okay if GenPyTorch' doesn't support the same flexibility as JuliaParameterStore. |
@alex-lew Right, thanks. I think it's reasonable for it to either (i) error if you call it more than once, or (ii) only ever add parameters and not reset the dictionary. I think |
Right. Users already had to initialize all the parameters, so this PR doesn't introduce significant new burden on users. But we could avoid making users register parameters at all using a static analysis of the DML function. |
@alex-lew To be clear, I think we should implement the minimal changes to GenPyTorch to make it compatible with this PR, or maybe change this PR if that's not possible for some reason, before merging this PR master. |
That could be one way of doing it -- i.e., when implementing a parameter store, you can choose to look up a list of optimizer expressions and implement the necessary methods (perhaps in a loop, using @eval). |
I am asking specifically about the fact that
So, if called without a custom parameter context, where should a PyTorch generative function look up its parameters? Is the idea that a package like GenPyTorch should mutate the default parameter context, extending it with a new key? Or is it an error to call a GenPyTorch GF without providing a custom parameter context?
This seems plausible, but I would need to think about it. Currently, the user constructs a PyTorch model object, which stores some parameters, and this model object is stored within the generative function. Off the top of my head, it's unclear how to run the model object with 'different parameter objects' accessed inside the PyTorchParameterStore. i.e., if the PyTorchParameterStore is not just a dummy struct, but actually stores references to parameters, I am not sure how to make the execution of the PyTorch GF actually depend on the references that are stored there. Presumably this is possible -- I'd just need to do some reading of the PyTorch docs. |
Got it! I'm finishing some conference reviews this week and may not have time to do this until after vacation, maybe first week of August? Sorry for the delay! |
Note: This PR includes breaking API changes and documentation is also not yet complete for it. It is not ready for review.This PR is now ready for review.
Introduces "parameter contexts" and "parameter stores", so that different traces of the same generative function can use different sets of parameters. The built-in parameter store type is a
JuliaParameterStore
, but there can be other types of parameter stores, including those that contain the states of TensorFlow and PyTorch parameters. There is a default parameter context, which references a defaultJuliaParameterStore
. The GFI methods simulate, generate, propose, and assess take an optional parameter context argument.Makes accumulating gradients for
JuliaParameterStore
s a thread-safe operation, so that multiple traces can concurrently accumulate gradients for the same parameters in the same parameter store. This is useful for stochastic gradient training, where gradients for each a each trace in a mini-batch can be computed concurrently.Reduces amount of re-allocation of parameter-sized arrays during gradient accumulation and parameter updates by using in-place updates of gradient accumulators in various places.
It is now possible to construct optimizers that automatically apply to all parameters used by a generative function including in nested calls to other generative functions. This uses a new
get_parameters(gen_fn, parameter_context)
function, which is implemented for SML by statically walking the call tree, and for DML by user-provided list of all parameters via a new functionregister_parameters!
.Many other changes to API for updates.
Some code quality improvements.
Adds optional
multithreaded
flags to several inference functions, includingimportance_sampling
,black_box_vi!
, andblack_box_vimco!
.