DDP Performance #37

ClashLuke · 2024-12-16T10:04:26Z

I've tried reproducing your results, which works well when running on one GPU. It just takes a long time to train. So, your recent DDP addition was an exciting new feature I had to try out.
Unfortunately, scaling from 1x H100 to 8x H100 (in the same DGX-H100 node) decreases it/sec from 8.71 to 0.61. Assuming it does 8x as much work per step, that still is slower than the single-GPU baseline.

Did I miss some config options?

AdamJelley · 2024-12-17T06:06:35Z

Hi @ClashLuke, unfortunately we're not sure what causes the slow down when increasing the number of GPUs to 8, since we only tried 2 GPU machines, but would guess it is likely a data-loading bottleneck. You could try increasing the number of workers used by the data loader in the trainer config to see if that helps reduce the bottleneck with more GPUs.

In any case, this is more of a general issue with using DDP and depends on your hardware. You could check out some of the PyTorch forum discussions on this topic (e.g. here) to see if some of the advice there is relevant to your set up. Hope that helps.

ClashLuke · 2024-12-17T16:10:06Z

Diamond is an interesting codebase for many experiments, so I'd love to get this fixed with you guys. I've run some timings below, which may help you identify where the bottleneck is:

100 environments, 1 H100, mean of 4 epochs
Denoiser: 8.67 it/s
Reward End Model: 14.70 it/s
Actor Critic: 2.73 it/s
Initial Dataset Collection:  1314.99 it/s

100 environments, 8 H100s, first epoch
Denoiser: 1.79s/it
Initial Dataset Collection: 1933.13 it/s

10 environments, 1 H100, mean of 4 epochs
Denoiser: 9.15 it/s
Reward End Model: 14.79 it/s
Actor Critic: 2.74 it/s
Initial Dataset Collection:  586.55 it/s

10 environments, 8 H100s, first epoch
Denoiser: 0.625 it/s
Initial Dataset Collection:  670.88 it/s

1 environemt, 1 H100, mean of 4 epochs
Denoiser: 8.89 it/sec
Reward End Model: 14.65 it/sec
Actor Critic: 0.286 it/sec
Initial Dataset Collection: 82.09 it/s

1 environment, 8 H100s, first epoch
Denoiser: 0.58 it/s
Initial Dataset Collection: 117.21 it/s

Considering that the initial dataset collection is faster with multiple GPUs and that SXM4 H100s have high interconnect bandwidth, I doubt it's a PyTorch issue. This intuition is solidified because adding gradient_as_bucket_view=True, static_graph=True to the DDP wrapper and/or using fused AdamW does not improve the speed while helping ~all other codebases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP Performance #37

DDP Performance #37

ClashLuke commented Dec 16, 2024

AdamJelley commented Dec 17, 2024

ClashLuke commented Dec 17, 2024

DDP Performance #37

DDP Performance #37

Comments

ClashLuke commented Dec 16, 2024

AdamJelley commented Dec 17, 2024

ClashLuke commented Dec 17, 2024