You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the optimizer step, I get the following output.
Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932
So, my expectation was the straight pipeline would be roughly similar to the DP timings.
But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.
The main reason is that the Gloo peer-to-peer communication primitives are not well optimized. I am hopeful that this problem will at least partially go away when the PyTorch folks upstream the NCCL send and recv primitives, and we can potentially switch to using NCCL throughout.
Hi,
I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.
I ran the following commands for the profiler and the optimizer.
On the optimizer step, I get the following output.
So, my expectation was the straight pipeline would be roughly similar to the DP timings.
But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.
I'd be very grateful if you could help me, figuring out this discrepancy?
I have drawn the gannt charts for,
pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).
It looks to me that each stage is stagnated on the comms for a considerable period of time.
The text was updated successfully, but these errors were encountered: