Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP Performance #37

Open
ClashLuke opened this issue Dec 16, 2024 · 2 comments
Open

DDP Performance #37

ClashLuke opened this issue Dec 16, 2024 · 2 comments

Comments

@ClashLuke
Copy link

I've tried reproducing your results, which works well when running on one GPU. It just takes a long time to train. So, your recent DDP addition was an exciting new feature I had to try out.
Unfortunately, scaling from 1x H100 to 8x H100 (in the same DGX-H100 node) decreases it/sec from 8.71 to 0.61. Assuming it does 8x as much work per step, that still is slower than the single-GPU baseline.

Did I miss some config options?

@AdamJelley
Copy link
Collaborator

Hi @ClashLuke, unfortunately we're not sure what causes the slow down when increasing the number of GPUs to 8, since we only tried 2 GPU machines, but would guess it is likely a data-loading bottleneck. You could try increasing the number of workers used by the data loader in the trainer config to see if that helps reduce the bottleneck with more GPUs.

In any case, this is more of a general issue with using DDP and depends on your hardware. You could check out some of the PyTorch forum discussions on this topic (e.g. here) to see if some of the advice there is relevant to your set up. Hope that helps.

@ClashLuke
Copy link
Author

Diamond is an interesting codebase for many experiments, so I'd love to get this fixed with you guys. I've run some timings below, which may help you identify where the bottleneck is:

100 environments, 1 H100, mean of 4 epochs
Denoiser: 8.67 it/s
Reward End Model: 14.70 it/s
Actor Critic: 2.73 it/s
Initial Dataset Collection:  1314.99 it/s

100 environments, 8 H100s, first epoch
Denoiser: 1.79s/it
Initial Dataset Collection: 1933.13 it/s

10 environments, 1 H100, mean of 4 epochs
Denoiser: 9.15 it/s
Reward End Model: 14.79 it/s
Actor Critic: 2.74 it/s
Initial Dataset Collection:  586.55 it/s

10 environments, 8 H100s, first epoch
Denoiser: 0.625 it/s
Initial Dataset Collection:  670.88 it/s

1 environemt, 1 H100, mean of 4 epochs
Denoiser: 8.89 it/sec
Reward End Model: 14.65 it/sec
Actor Critic: 0.286 it/sec
Initial Dataset Collection: 82.09 it/s

1 environment, 8 H100s, first epoch
Denoiser: 0.58 it/s
Initial Dataset Collection: 117.21 it/s

Considering that the initial dataset collection is faster with multiple GPUs and that SXM4 H100s have high interconnect bandwidth, I doubt it's a PyTorch issue. This intuition is solidified because adding gradient_as_bucket_view=True, static_graph=True to the DDP wrapper and/or using fused AdamW does not improve the speed while helping ~all other codebases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants