-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP Performance #37
Comments
Hi @ClashLuke, unfortunately we're not sure what causes the slow down when increasing the number of GPUs to 8, since we only tried 2 GPU machines, but would guess it is likely a data-loading bottleneck. You could try increasing the number of workers used by the data loader in the trainer config to see if that helps reduce the bottleneck with more GPUs. In any case, this is more of a general issue with using DDP and depends on your hardware. You could check out some of the PyTorch forum discussions on this topic (e.g. here) to see if some of the advice there is relevant to your set up. Hope that helps. |
Diamond is an interesting codebase for many experiments, so I'd love to get this fixed with you guys. I've run some timings below, which may help you identify where the bottleneck is:
Considering that the initial dataset collection is faster with multiple GPUs and that SXM4 H100s have high interconnect bandwidth, I doubt it's a PyTorch issue. This intuition is solidified because adding |
I've tried reproducing your results, which works well when running on one GPU. It just takes a long time to train. So, your recent DDP addition was an exciting new feature I had to try out.
Unfortunately, scaling from 1x H100 to 8x H100 (in the same DGX-H100 node) decreases it/sec from 8.71 to 0.61. Assuming it does 8x as much work per step, that still is slower than the single-GPU baseline.
Did I miss some config options?
The text was updated successfully, but these errors were encountered: