make `EnergyModel` pickable #126

felixmusil · 2021-12-10T20:21:25Z

Description

The current models created with EnergyModel are not pickable. For instance, a deepcopy of such model will fail with the current master branch (see below for an example).

The problem comes from using activation functions directly from torch.nn.functional (which are not pickable) instead of torch.nn (which are torch.nn.Module hence piclable).

from copy import deepcopy
from nequip.model import model_from_config
from nequip.scripts.deploy import _compile_for_deploy
import yaml
import torch

conf = """
model:
  model_builders:
      - EnergyModel
      - ForceOutput
  BesselBasis_trainable: true
  PolynomialCutoff_p: 6
  chemical_symbol_to_type:
      H: 0
      O: 1
  num_types: 2
  type_names: 
      - H
      - O
  avg_num_neighbors: null
  chemical_embedding_irreps_out: 16x0e
  conv_to_output_hidden_irreps_out: 32x0e
  feature_irreps_hidden: 32x0o + 32x0e + 16x1o + 16x1e
  invariant_layers: 2
  invariant_neurons: 64
  irreps_edge_sh: 0e + 1o
  nonlinearity_gates:
    e: ssp
    o: abs
  nonlinearity_scalars:
    e: ssp
    o: tanh
  nonlinearity_type: gate
  num_basis: 8
  num_layers: 2
  r_max: 2.5
  resnet: true
  use_sc: true
"""
config = yaml.safe_load(conf)

model = model_from_config(config['model'], initialize=True)
deepcopy(model)

Motivation and Context

A piclable model can be saved easily and used in a multiprocessing context.

How Has This Been Tested?

The code above runs without errors and the tests are passing.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds or improves functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

My code follows the code style of this project and has been formatted using black.
All new and existing tests passed.
I have added tests that cover my changes (if relevant).
The option documentation (docs/options) has been updated with new or changed options.
I have updated CHANGELOG.md.
I have updated the documentation (if relevant).

simonbatzner · 2021-12-10T20:41:52Z

Thanks @felixmusil for the PR, that all looks good. Have you run explicit checks that torch.nn.functional.x and torch.nn.X give identical answers?

Since it seems like you're dealing with pretty low-level stuff in the code, just wanted to extend that if you have any other questions, please feel free to reach out to us any time.

Best,
Simon

Linux-cpp-lisp · 2021-12-10T20:54:51Z

Hi @felixmusil,

Thanks for the PR. Beyond Simon's point, while I'm happy to merge this, I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.

The reason for this is that it carries a lot of internal noise, unnecessarily constrains development, and introduces a strong dependence of saved models on the internals of library code even when nothing fundamental has changed. (This is particularly relevant to e3nn, whose internals often change even when the interfaces are stable.)

The new supported approach is to rebuild the model from the configuration wherever you need it, and to save the state_dict() of the model you want to send/save and reload it in the new models. This is how our trainer now saves model state and has the benefit of being unaffected by internal library details, as well as being easy to open, inspect, and modify.

Linux-cpp-lisp · 2021-12-10T20:58:48Z

Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...

(Another warning: the ability to pickle e3nn models is something I implemented in that library with a lot of hacks that are somewhat sensitive to what device you built the object on and what devices are available where you load it...)

felixmusil · 2021-12-11T00:21:53Z

Thanks for the quick reply.

Have you run explicit checks that torch.nn.functional.x and torch.nn.X give identical answers?

Running this snippet shows that the results are the same within machine precision.

x = torch.linspace(-10, 10, 100000)
for func1,func2 in [(torch.nn.functional.silu, torch.nn.SiLU()), 
                    (torch.nn.functional.softplus, torch.nn.Softplus()),
                    (torch.tanh, torch.nn.Tanh())]:
    y1 = func1(x)
    y2 = func2(x)
    print(torch.allclose(y1,y1))

Actually, also, have you tested that this actually works once this fix is included? I remember there being other things broken with pickling as well...

yes, if you run the failing example above with this fix, it works.

I want to be clear that we dropped the use of pickling for models / any commitment to support it as of 0.4.0.

ok I see. I understand your point about saving and your new saving strategy is the way to go. However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.
An alternative way to go about it would be to overload __setstate__ and __getstate__ at some strategic points and use the prefered serialization strategy (see https://stackoverflow.com/questions/1939058/simple-example-of-use-of-setstate-and-getstate for an example).

Linux-cpp-lisp · 2021-12-22T03:55:39Z

Running this snippet shows that the results are the same within machine precision.
yes, if you run the failing example above with this fix, it works.

Great, thanks for checking.

However pickling is essential to work with multiprocessing applications, e.g. multi gpu training.

I'm not actually sure that this is true--- can you elaborate on your setup? We are beginning to look into multi-GPU training and so far it seems to me that the most natural approach is for each process to build it's own copy of the model and to then distribute rank 0's model.state_dict() (which is a simple Dict[str, torch.Tensor]) using the appropriate mechanism for your parallelism library. The other ranks can then load it with model.load_state_dict(...).

This should also involve less IPC and IO than full pickled models, because then you don't have to serialize/de-serialize TorchScript blobs for generated code from e3nn. (The only way I was able to make e3nn's dynamically codegen'd modules pickle-compatible was to have them serialize their compiled, optimized TorchScript cores using a __getstate__/__setstate__ approach like you mentioned; see here.)

Linux-cpp-lisp · 2022-01-19T21:27:36Z

Hi @felixmusil just curious if the above worked for you or if this is still an issue 🙂

simonbatzner · 2022-03-21T17:29:19Z

Hi @felixmusil, it looks like you're making edits on this branch but the PR is still open.

We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.

Shall we close this PR?

Linux-cpp-lisp · 2022-04-06T18:50:01Z

Hi @felixmusil ,

I'm seeing a few more commits to this PR— is this still something you are trying to do, or can we close this PR? (Was rebuilding the model in child processes sufficient for you?)

I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?

Thanks!

felixmusil · 2022-05-05T19:30:59Z

Hi @Linux-cpp-lisp,

Sorry for disappearing on you. I reverted the unnecessary commits and updated the PR branch to the current develop. So it's finally ready to be merged unless you have some additional comments or requirements?

I'm not actually sure that this is true--- can you elaborate on your setup?

You are right, it's really tied with me using pytorch_lightning for training. It provides many nice features but also some requirements like pickling for saving states of the hyperparameters.

We do not wrap the atoms in NequIP since our neighborlists respect unwrapped positions and wrapping them would not allow us to do simulation in which we want to keep track of unwrapped positions, e.g. computing the diffusivity.

I understand that and I removed this commit from the PR.

I'm also seeing that you made this "remove empty dim" commit— what is its purpose, and is it related to some issue with shapes you may have found?

For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.

Linux-cpp-lisp · 2022-05-12T20:38:16Z

@felixmusil sounds good, I'll take a look at this again soon here for merging.

Re multi-GPU support, take a look at this recent issue: #210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod branch, which will be the "official" solution in the future.

For some reason, I have a corner case where an additional empty dimension (of size 1) was present which was not compatible with the rest of the computations.

Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.

felixmusil · 2022-05-24T15:21:27Z

Re multi-GPU support, take a look at this recent issue: #210. Basically, unless you are really married to PyTorch Lightning, I would recommend looking into our horovod branch, which will be the "official" solution in the future.

Thanks for the suggestion, I will have a look at how you handled the problem. I have a few other models working with PyTorch Lightning though, so I would like to stick with it for a while longer.

Hm yeah this sounds like it could be a bug on our part... are you able to reproduce this still on the latest NequIP develop, and if so, are you willing to share a reproducing example? Not a big deal, just want to be sure we aren't missing some bug.

I will raise an issue/PR if I manage to reproduce it.

Co-authored-by: Félix <[email protected]>

Linux-cpp-lisp · 2022-12-20T23:25:29Z

Hi @felixmusil ,

I came back to this again now after a while and I thought I would commit it (I've added a commit off of current develop with you as a coauthor, see the branch: https://github.com/mir-group/nequip/tree/pickle).

However, for reasons I'm not sure of (call stack/module stack depth?) this has a non-negligible negative effect on speed, so I will not be merging it. Closing this PR for that reason, but the pickle branch is there as a version of this off the latest develop for those who absolutely need it.

(See above for discussion of our supported state_dict() approach for saving models.)

felixmusil added 2 commits December 10, 2021 20:30

Merge remote-tracking branch 'upstream/main'

f1a9b33

switch activation function to Modules

c2cef11

Linux-cpp-lisp changed the base branch from main to develop December 10, 2021 20:25

nw13slx requested a review from Linux-cpp-lisp December 10, 2021 20:25

Linux-cpp-lisp added 3 commits December 10, 2021 15:44

Merge branch 'develop' into felix-pickle

b498525

Merge branch 'develop' into felix-pickle

babb175

remove unneeded submodule

e3db509

felixmusil force-pushed the main branch from 18bb26f to e3db509 Compare May 5, 2022 19:18

felixmusil added 2 commits May 5, 2022 21:21

Merge branch 'main' of https://github.com/mir-group/nequip

fa2e497

Merge branch 'develop' of https://github.com/mir-group/nequip

4287362

JonathanSchmidt1 mentioned this pull request May 11, 2022

Multi-GPU support exists❓ [QUESTION] #210

Open

Linux-cpp-lisp added 2 commits May 23, 2022 18:50

Merge branch 'develop' into main

cb85590

still compile ssp

9424ada

Linux-cpp-lisp added a commit that referenced this pull request Dec 20, 2022

Resolve #126

c92705e

Co-authored-by: Félix <[email protected]>

Linux-cpp-lisp closed this Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make `EnergyModel` pickable #126

make `EnergyModel` pickable #126

felixmusil commented Dec 10, 2021

simonbatzner commented Dec 10, 2021

Linux-cpp-lisp commented Dec 10, 2021

Linux-cpp-lisp commented Dec 10, 2021

felixmusil commented Dec 11, 2021

Linux-cpp-lisp commented Dec 22, 2021 •

edited

Loading

Linux-cpp-lisp commented Jan 19, 2022

simonbatzner commented Mar 21, 2022

Linux-cpp-lisp commented Apr 6, 2022

felixmusil commented May 5, 2022 •

edited

Loading

Linux-cpp-lisp commented May 12, 2022

felixmusil commented May 24, 2022

Linux-cpp-lisp commented Dec 20, 2022

make EnergyModel pickable #126

make EnergyModel pickable #126

Conversation

felixmusil commented Dec 10, 2021

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

simonbatzner commented Dec 10, 2021

Linux-cpp-lisp commented Dec 10, 2021

Linux-cpp-lisp commented Dec 10, 2021

felixmusil commented Dec 11, 2021

Linux-cpp-lisp commented Dec 22, 2021 • edited Loading

Linux-cpp-lisp commented Jan 19, 2022

simonbatzner commented Mar 21, 2022

Linux-cpp-lisp commented Apr 6, 2022

felixmusil commented May 5, 2022 • edited Loading

Linux-cpp-lisp commented May 12, 2022

felixmusil commented May 24, 2022

Linux-cpp-lisp commented Dec 20, 2022

make `EnergyModel` pickable #126

make `EnergyModel` pickable #126

Linux-cpp-lisp commented Dec 22, 2021 •

edited

Loading

felixmusil commented May 5, 2022 •

edited

Loading