Optimizing on x86 CPUs #303

npiroozan · 2023-02-16T07:50:38Z

npiroozan
Feb 16, 2023

Hello,

I hope you are doing well. Apologies if this is a bit of a simple question, but I was trying to use mpirun in training with nequip. My hope is to try and minimize the training time by using mutiple mpi ranks. Does this require building torch from source and build against MPI, etc?

Thank you very much.

ghost · 2023-02-25T13:19:12Z

ghost
Feb 25, 2023

Hello, wondering about the same

0 replies

npiroozan · 2023-02-28T06:35:54Z

npiroozan
Feb 28, 2023
Author

So building pytorch from source does produce significantly improved performance with OpenMP parallelization compared to the general release torch-cpu. However, MPI parallelization is not something present. Is nequip training MPI parallelized at all? Thank you.

1 reply

Linux-cpp-lisp Feb 28, 2023
Maintainer

Good to know! Thanks for sharing your efforts on this.

npiroozan · 2023-02-28T13:17:38Z

npiroozan
Feb 28, 2023
Author

Can confirm that there is no mpi parallelization on cpus. Single instance training only. Best performance can be found through the proper setting of intra and inter op parallelization, omp, etc.

1 reply

npiroozan Feb 28, 2023
Author

for example, setting "mpirun -n 4 nequip-train example.yaml"

will create a single instance of training on the first mpi rank, but nothing will be utilized for mpi ranks 2-4. Indeed, the latter 3 will search for trainer.pth under /results, and if one is training from scratch there will be no such folder. In essence, the first mpi rank trains the model and the latter 3 are looking to restart training.

If running on CPU-only set to "mpirun -n 1 nequip-train example.yaml". Also, a 40% improvement in performance is observed in nequip performance on CPU only systems when pytorch is built from source.

Linux-cpp-lisp · 2023-02-28T16:29:24Z

Linux-cpp-lisp
Feb 28, 2023
Maintainer

Multi-CPU training with MPI is not available in our code--- we primarily target GPUs and strongly recommend that general users use them.

As mentioned above, we inherit thread-based CPU parallelism from PyTorch's use of OpenMP / MKL, which seems to require a from-source build to enable. MPI multi-node parallelization of model training is not available. Model inference for Allegro can be parallelized over CPU MPI ranks in LAMMPS exactly the same way as over multiple GPUs using pair_allegro, which itself supports OpenMP as well in the wrapper when running on CPU without Kokkos.

The closest thing to MPI we have is the prototype of multi-GPU training available on the ddp branch, which should in theory work just as well as multi-MPI-rank training (with each DDP rank/node using multithreaded MKL parallelism internally). I believe torchrun can also use MPI as it's backend for multi-node communication, although Gloo might be better (see https://pytorch.org/docs/stable/distributed.html#which-backend-to-use). This is very much a prototype, as well as being behind the develop branch, so if you make notes on usage, find or fix bugs, etc. please feel free to file PRs.

6 replies

Linux-cpp-lisp Mar 2, 2023
Maintainer

Cool! If any of these changes are small / easily integrated / don't conflict with existing code or GPU use, please feel free to submit them as PRs! (Similarly, if there are build / install settings you recommend for CPU, you can submit a PR changing README.md or we can add another README_CPU.md or something.) I know there have been other people interested in running on CPUs who'd love to hear more about your results.

Some thoughts in no particular order:

MPI: maybe possible with DDP, as above
bfloat16: be careful with this one; low precision is much more dangerous for subtle regression problems like ours than classification stuff it is usually advertised on.
ipex: I never got this working, but was on AMD hardware... have you tried this already?

So far, with changes made this accounts for a 40% improvement in time to completion for training versus stock release.

Wow! This was just from compiling from source to link system MKL, or something else?

In recent PyTorch I know they are also adding some kind of support for oneDNN graph compilation as a JIT backend, which could be very relevant. On the very latest NequIP develop, with the latest PyTorch (1.13 or newer, I'd try for newest 2.0 nightly cause its in active development) you should be able to enable this with setting the _jit_fuser YAML option to fuser3 (double check that 😄 ).

One other question: Why is there validation at the very beginning of the training process?

Just can be a useful metric to know how your model performs at initialization. Can be disabled with report_init_validation: false.

Linux-cpp-lisp Mar 2, 2023
Maintainer

If I may ask, since you would know best, would you consider this code to be memory bandwidth bound or compute bound?

I don't know 😆 I'm also not sure, in general, if it's a particularly meaningful question to ask about the whole training process / model inference? I'm not sure.

Regardless, NequIP and particularly Allegro are heavily optimized and I suspect all major speed-ups you will see on CPU would come from using the best possible PyTorch backends / settings (as you saw with compilation). If curious, you can use nequip-benchmark --profile to look deeper inside what is taking the most time in your CPU inference.

npiroozan Mar 24, 2023
Author

Hello!

Apologies for the late reply! Yes, indeed, a 40% improvement was observed simply through compiling from source and linking to MKL. It helped a lot. Primary bottlenecks observed now are in the backward pass. I'm hopeful that linking to IPEX extension can help with this.

Do you know how one would go about linking to IPEX? I haven't been able to get it working quite yet.

Thank you very much!

Linux-cpp-lisp Mar 24, 2023
Maintainer

hi @npiroozan ,

Nice!

Regarding IPEX, I tried once before and couldn't get it to work; I think they keep changing the API. You might want to look into the new oneDNN graph fusion backend introduced in PyTorch 2.0, which might be the replacement? I'm not honestly sure.

npiroozan Mar 31, 2023
Author

Hi Guys,

ipex is working with the code. Seeing about a 10% improvement in terms of time to completion for training. Correct me if I am wrong but I placed the ipex optimizer at the following point:

def from_dict

model, _ = Trainer.load_model_from_training_session(
                traindir=load_path.parent,
                model_name=load_path.name,
                config_dictionary=dictionary,
            )

model, optimizer = ipex.optimize(model, optimizer=optimizer)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing on x86 CPUs #303

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimizing on x86 CPUs #303

npiroozan Feb 16, 2023

Replies: 4 comments · 8 replies

ghost Feb 25, 2023

npiroozan Feb 28, 2023 Author

Linux-cpp-lisp Feb 28, 2023 Maintainer

npiroozan Feb 28, 2023 Author

npiroozan Feb 28, 2023 Author

Linux-cpp-lisp Feb 28, 2023 Maintainer

Linux-cpp-lisp Mar 2, 2023 Maintainer

Linux-cpp-lisp Mar 2, 2023 Maintainer

npiroozan Mar 24, 2023 Author

Linux-cpp-lisp Mar 24, 2023 Maintainer

npiroozan Mar 31, 2023 Author

npiroozan
Feb 16, 2023

Replies: 4 comments 8 replies

ghost
Feb 25, 2023

npiroozan
Feb 28, 2023
Author

Linux-cpp-lisp Feb 28, 2023
Maintainer

npiroozan
Feb 28, 2023
Author

npiroozan Feb 28, 2023
Author

Linux-cpp-lisp
Feb 28, 2023
Maintainer

Linux-cpp-lisp Mar 2, 2023
Maintainer

Linux-cpp-lisp Mar 2, 2023
Maintainer

npiroozan Mar 24, 2023
Author

Linux-cpp-lisp Mar 24, 2023
Maintainer

npiroozan Mar 31, 2023
Author