CPU Parallelization #197

atulcthakur · 2022-04-11T22:57:23Z

atulcthakur
Apr 11, 2022

Hi,
I have been trying to train a model with NequIP on my system which is just a DFT simulation of a molecular crystal. The system size is 256 atoms and the total dataset contains about 8000 frames in an ase extxyz format. I'm using the exact settings as in minimal.yaml file.

I don't have access to GPUs so I have been trying to train a model on CPUs with openmp parallelization. I have tried different values for OMP_NUM_THREADS and MKL_NUM_THREADS variables as recommended in some pytorch openmp discussion threads, but all of them seem to require over an hour (sometimes even 2) for 1 epoch.
I'm not sure if this is the correct way of parallel training with NequIP because the timings per epoch seem quite huge. Also, there seem to be no discussion about this in the docs or in discussions so any guidance on this would be much appreciated.

Thanks.

Answered by Linux-cpp-lisp

Mar 2, 2023

Latest discussion on CPUs can be found here: #303 (comment)

View full answer

Linux-cpp-lisp · 2022-04-12T14:01:24Z

Linux-cpp-lisp
Apr 12, 2022
Maintainer

Hi @atulcthakur ,

(A side note: we recommend starting from example.yaml; minimal.yaml is really just for testing.)

Are you also using n_train: 5; n_val: 5 like minimal.yaml? Two hours for 8000 somewhat large frames on CPU is not unreasonable; for 5 frames it is crazy.

One thing you can do to get a bit of a sense of timing is to run nequip-benchmark my-config-file.yaml, which will run and benchmark your model on your data on your hardware. This will tell you something about the speed of the model on your hardware that can be compared to the epoch training time (something like time-per-step * 2.5 * n_train+n_val might be a reasonable epoch time guess).

When you run with OMP_NUM_THREADS and MKL_NUM_THREADS set, do you see the training using more than one core on your machine? Does it use as many as you ask for?

I previously tried this at one point on a pretty powerful AMD CPU and I think I had to set some environment var to convince MKL to run efficiently (see https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/), this could be relevant if you are on AMD.

Overall, CPU performance for training is not something we've really looked at because it is a fairly unusual usecase, both for us and for PyTorch itself.

1 reply

Linux-cpp-lisp Apr 12, 2022
Maintainer

Another useful link on the MKL AMD thing that includes PyTorch benchmarks, up to 2x improvement in some cases it seems: https://gist.github.com/inoryy/1900d368bf3ad213493042edbb79acb3

atulcthakur · 2022-04-15T14:08:27Z

atulcthakur
Apr 15, 2022
Author

Hi @Linux-cpp-lisp,

Thanks for your response. I am training with example.yaml using n_train: 7000 and n_val: 1000. My bad in saying minimal.yaml when I actually meant example.yaml.

When I run with OMP_NUM_THREADS and MKL_NUM_THREADS set, I see that CPU usage is 100% but the thread count remains at 12 no matter what value I use for these variables. I'm not entirely sure why this happens !!

As far as the AMD SIMD throttling thing goes, I'm working on an intel platform so it is unlikely to work for me. I am inclined to give it a shot though.

As a follow up, I wanted to ask if 8000 frames is too much of training data for NequIP? My training data is pretty diverse (sampled from AIMD simulations over a broad range of temperatures) so I can scale it down to 2000 frames or even less.
I wanted to know what is the typical size of the training set required for periodic bulk phase systems? and what are the typical training times involved (even GPU estimates are okay) ?

0 replies

Linux-cpp-lisp · 2022-04-15T15:26:59Z

Linux-cpp-lisp
Apr 15, 2022
Maintainer

@atulcthakur

Re num threads, I believe that past a certain point MKL will just ignore you if you are asking for more threads than makes sense. I'm guessing your system has 6 physical cores/12 logical cores? If you are maxing out CPU usage there is nothing to gain by going higher... and I suspect that MKL is also just ignoring you when you go lower too 😄

As far as the AMD SIMD throttling thing goes, I'm working on an intel platform so it is unlikely to work for me. I am inclined to give it a shot though.

No gain expected then...

As a follow up, I wanted to ask if 8000 frames is too much of training data for NequIP? My training data is pretty diverse (sampled from AIMD simulations over a broad range of temperatures) so I can scale it down to 2000 frames or even less.
I wanted to know what is the typical size of the training set required for periodic bulk phase systems? and what are the typical training times involved (even GPU estimates are okay) ?

cc @simonbatzner, but also can you provide a little more detail? This will depend strongly on the chemical, reactive, etc. complexity of your system. The amount of data needed for a monometalic bulk will be very different than for some chemically complex diffusive system, for example.

2 replies

simonbatzner Apr 18, 2022
Maintainer

Hi @atulcthakur, training set requirements depend heavily on what you're studying and what distribution your training data come from, can you add some detail?

For bulk-phase, isotropic systems, I'd start with 1,000 examples or so, and I'd start with data that are much higher than the target temperature you are interested in simulating. For example, for the Li4P2O7 system we used in the NequIP paper, we used 1,000 frames of ~200 atoms, simulated at T=3000K and got really good results at T=600K. That's of course a massive temperature difference (the material is far beyond its melting point here), but even so I'd recommend cranking up the temperature quite a lot (I'd go up until to maybe 90% of the melting temp in most applications). If you have data from higher-T AIMD, I would start with only using those in the training set and maybe start with 1,000, see where that gets you and then add more if you need it.

In general, more data is better, with the caveat that if the distribution of training data is very diverse it will be more difficult to fit so you'll have to choose a potentially larger model and will have to see if the additional training data give you any improvement on the test distribution that you're sampling from during your MD simulation.

atulcthakur Apr 18, 2022
Author

Hi @simonbatzner and @Linux-cpp-lisp,

Thanks for your responses.

@Linux-cpp-lisp I'm working on a system with 52 hardware threads and twice that many virtual. I think that the parallelization over only 12 threads stems from the pytorch defaults rather than MKL. For instance, something like TensorFlow can correctly parallelize over the specified number of threads and can show great speedup upon increasing the thread count. I do want to add that I have to set some global TensorFlow openmp and MKL variables for it to work though.

@simonbatzner and @Linux-cpp-lisp
My systems are simple molecular crystals composed of one or more small organic molecules such as ethane, carbon dioxide, butane etc. and are very similar to ice crystal. There is no reactive dynamics going on so I think it's a pretty simple system. I think @simonbatzner suggestions are great and very relevant. I do have a molten state in my training data right now and I will try out restricting the training only on the molten state as suggested.

Thanks again for all the help and discussion @Linux-cpp-lisp and @simonbatzner.

Linux-cpp-lisp · 2022-04-18T14:45:33Z

Linux-cpp-lisp
Apr 18, 2022
Maintainer

@atulcthakur re num threads, I completely forgot but there is a PyTorch option as well:
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html

As well as get_num_threads()... try checking if PyTorch is reading your env vars? (does the result of this func align with env vars you set)
(See maybe pytorch/pytorch#975?)

0 replies

Linux-cpp-lisp · 2023-03-02T02:30:57Z

Linux-cpp-lisp
Mar 2, 2023
Maintainer

Latest discussion on CPUs can be found here: #303 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Parallelization #197

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CPU Parallelization #197

atulcthakur Apr 11, 2022

Replies: 5 comments · 3 replies

Linux-cpp-lisp Apr 12, 2022 Maintainer

Linux-cpp-lisp Apr 12, 2022 Maintainer

atulcthakur Apr 15, 2022 Author

Linux-cpp-lisp Apr 15, 2022 Maintainer

simonbatzner Apr 18, 2022 Maintainer

atulcthakur Apr 18, 2022 Author

Linux-cpp-lisp Apr 18, 2022 Maintainer

Linux-cpp-lisp Mar 2, 2023 Maintainer

atulcthakur
Apr 11, 2022

Replies: 5 comments 3 replies

Linux-cpp-lisp
Apr 12, 2022
Maintainer

Linux-cpp-lisp Apr 12, 2022
Maintainer

atulcthakur
Apr 15, 2022
Author

Linux-cpp-lisp
Apr 15, 2022
Maintainer

simonbatzner Apr 18, 2022
Maintainer

atulcthakur Apr 18, 2022
Author

Linux-cpp-lisp
Apr 18, 2022
Maintainer

Linux-cpp-lisp
Mar 2, 2023
Maintainer