NEP training not running on multiple GPU nodes #731

Sonti974948 · 2024-09-18T05:58:47Z

Sonti974948
Sep 18, 2024

Hi GPUMD team,

I was trying to obtain a scaling plot to compare the speed of training with increasing number of GPUs, and though I was able to obtain a good scaling within 1 GPU node (i.e. across 1,2,4 GPUs), when I submitted a job with 2 GPU nodes (i.e. 8 GPUs), the training was still using only 1 GPU node (4 GPUs).

Is there a way to train a NEP model on 2 different nodes?

Thank you so much!
Siddharth Sonti

Answered by brucefan1983

Sep 18, 2024

Thanks for the question. There is curretly only support for single-node parallelization. To extend to more nodes, MPI is needed, but we have not implemented MPI yet.

View full answer

brucefan1983 · 2024-09-18T06:03:08Z

brucefan1983
Sep 18, 2024
Maintainer

Thanks for the question. There is curretly only support for single-node parallelization. To extend to more nodes, MPI is needed, but we have not implemented MPI yet.

1 reply

Sonti974948 Sep 18, 2024
Author

Thank you for the clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEP training not running on multiple GPU nodes #731

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

NEP training not running on multiple GPU nodes #731

Sonti974948 Sep 18, 2024

Replies: 1 comment · 1 reply

brucefan1983 Sep 18, 2024 Maintainer

Sonti974948 Sep 18, 2024 Author

Sonti974948
Sep 18, 2024

Replies: 1 comment 1 reply

brucefan1983
Sep 18, 2024
Maintainer

Sonti974948 Sep 18, 2024
Author