Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make EnergyModel pickable #126

Closed
wants to merge 9 commits into from
Closed

Conversation

felixmusil
Copy link

Description

The current models created with EnergyModel are not pickable. For instance, a deepcopy of such model will fail with the current master branch (see below for an example).

The problem comes from using activation functions directly from torch.nn.functional (which are not pickable) instead of torch.nn (which are torch.nn.Module hence piclable).

from copy import deepcopy
from nequip.model import model_from_config
from nequip.scripts.deploy import _compile_for_deploy
import yaml
import torch

conf = """
model:
  model_builders:
      - EnergyModel
      - ForceOutput
  BesselBasis_trainable: true
  PolynomialCutoff_p: 6
  chemical_symbol_to_type:
      H: 0
      O: 1
  num_types: 2
  type_names: 
      - H
      - O
  avg_num_neighbors: null
  chemical_embedding_irreps_out: 16x0e
  conv_to_output_hidden_irreps_out: 32x0e
  feature_irreps_hidden: 32x0o + 32x0e + 16x1o + 16x1e
  invariant_layers: 2
  invariant_neurons: 64
  irreps_edge_sh: 0e + 1o
  nonlinearity_gates:
    e: ssp
    o: abs
  nonlinearity_scalars:
    e: ssp
    o: tanh
  nonlinearity_type: gate
  num_basis: 8
  num_layers: 2
  r_max: 2.5
  resnet: true
  use_sc: true
"""
config = yaml.safe_load(conf)

model = model_from_config(config['model'], initialize=True)
deepcopy(model)

Motivation and Context

A piclable model can be saved easily and used in a multiprocessing context.

How Has This Been Tested?

The code above runs without errors and the tests are passing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds or improves functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

  • My code follows the code style of this project and has been formatted using black.
  • All new and existing tests passed.
  • I have added tests that cover my changes (if relevant).
  • The option documentation (docs/options) has been updated with new or changed options.
  • I have updated CHANGELOG.md.
  • I have updated the documentation (if relevant).

@Linux-cpp-lisp Linux-cpp-lisp changed the base branch from main to develop December 10, 2021 20:25
@simonbatzner
Copy link
Collaborator

Thanks @felixmusil for the PR, that all looks good. Have you run explicit checks that torch.nn.functional.x and torch.nn.X give identical answers?

Since it seems like you're dealing with pretty low-level stuff in the code, just wanted to extend that if you have any other questions, please feel free to reach out to us any time.

Best,
Simon

@Linux-cpp-lisp
Copy link
Collaborator

Hi @felixmusil,

Thanks for the PR. Beyond Simon's point, while I'm happy to merge this, I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.

The reason for this is that it carries a lot of internal noise, unnecessarily constrains development, and introduces a strong dependence of saved models on the internals of library code even when nothing fundamental has changed. (This is particularly relevant to e3nn, whose internals often change even when the interfaces are stable.)

The new supported approach is to rebuild the model from the configuration wherever you need it, and to save the state_dict() of the model you want to send/save and reload it in the new models. This is how our trainer now saves model state and has the benefit of being unaffected by internal library details, as well as being easy to open, inspect, and modify.

@Linux-cpp-lisp
Copy link
Collaborator

Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...

(Another warning: the ability to pickle e3nn models is something I implemented in that library with a lot of hacks that are somewhat sensitive to what device you built the object on and what devices are available where you load it...)

@felixmusil
Copy link
Author

Thanks for the quick reply.

Have you run explicit checks that torch.nn.functional.x and torch.nn.X give identical answers?

Running this snippet shows that the results are the same within machine precision.

x = torch.linspace(-10, 10, 100000)
for func1,func2 in [(torch.nn.functional.silu, torch.nn.SiLU()), 
                    (torch.nn.functional.softplus, torch.nn.Softplus()),
                    (torch.tanh, torch.nn.Tanh())]:
    y1 = func1(x)
    y2 = func2(x)
    print(torch.allclose(y1,y1))

Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...

yes, if you run the failing example above with this fix, it works.

I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.

ok I see. I understand your point about saving and your new saving strategy is the way to go. However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.
An alternative way to go about it would be to overload __setstate__ and __getstate__ at some strategic points and use the prefered serialization strategy (see https://stackoverflow.com/questions/1939058/simple-example-of-use-of-setstate-and-getstate for an example).

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Dec 22, 2021

Running this snippet shows that the results are the same within machine precision.
yes, if you run the failing example above with this fix, it works.

Great, thanks for checking.

However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.

I'm not actually sure that this is true--- can you elaborate on your setup? We are beginning to look into multi-GPU training and so far it seems to me that the most natural approach is for each process to build it's own copy of the model and to then distribute rank 0's model.state_dict() (which is a simple Dict[str, torch.Tensor]) using the appropriate mechanism for your parallelism library. The other ranks can then load it with model.load_state_dict(...).

This should also involve less IPC and IO than full pickled models, because then you don't have to serialize/de-serialize TorchScript blobs for generated code from e3nn. (The only way I was able to make e3nn's dynamically codegen'd modules pickle-compatible was to have them serialize their compiled, optimized TorchScript cores using a __getstate__/__setstate__ approach like you mentioned; see here.)

@Linux-cpp-lisp
Copy link
Collaborator

Hi @felixmusil just curious if the above worked for you or if this is still an issue 🙂

@simonbatzner
Copy link
Collaborator

Hi @felixmusil, it looks like you're making edits on this branch but the PR is still open.

We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.

Shall we close this PR?

@Linux-cpp-lisp
Copy link
Collaborator

Hi @felixmusil ,

I'm seeing a few more commits to this PR— is this still something you are trying to do, or can we close this PR? (Was rebuilding the model in child processes sufficient for you?)

I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?

Thanks!

@felixmusil
Copy link
Author

felixmusil commented May 5, 2022

Hi @Linux-cpp-lisp,

Sorry for disappearing on you. I reverted the unnecessary commits and updated the PR branch to the current develop. So it's finally ready to be merged unless you have some additional comments or requirements?

I'm not actually sure that this is true--- can you elaborate on your setup?

You are right, it's really tied with me using pytorch_lightning for training. It provides many nice features but also some requirements like pickling for saving states of the hyperparameters.

We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.

I understand that and I removed this commit from the PR.

I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?

For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.

@Linux-cpp-lisp
Copy link
Collaborator

@felixmusil sounds good, I'll take a look at this again soon here for merging.

Re multi-GPU support, take a look at this recent issue: #210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod branch, which will be the "official" solution in the future.

For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.

Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.

@felixmusil
Copy link
Author

Re multi-GPU support, take a look at this recent issue: #210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod branch, which will be the "official" solution in the future.

Thanks for the suggestion, I will have a look at how you handled the problem. I have a few other models working with PyTorch Lightning though, so I would like to stick with it for a while longer.

Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.

I will raise an issue/PR if I manage to reproduce it.

Linux-cpp-lisp added a commit that referenced this pull request Dec 20, 2022
Co-authored-by: Félix <[email protected]>
@Linux-cpp-lisp
Copy link
Collaborator

Hi @felixmusil ,

I came back to this again now after a while and I thought I would commit it (I've added a commit off of current develop with you as a coauthor, see the branch: https://github.com/mir-group/nequip/tree/pickle).

However, for reasons I'm not sure of (call stack/module stack depth?) this has a non-negligible negative effect on speed, so I will not be merging it. Closing this PR for that reason, but the pickle branch is there as a version of this off the latest develop for those who absolutely need it.

(See above for discussion of our supported state_dict() approach for saving models.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants