Actual results did not match the optimizer expectation #50

nirandaperera · 2020-06-08T16:32:58Z

Hi,

I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.

I ran the following commands for the profiler and the optimizer.

CUDA_VISIBLE_DEVICES=4 python main.py -a "resnet101"  -b 64 --data_dir "$HOME/data/imagenet-mini/" --profile_directory "profiles1/64"

python optimizer_graph_hierarchical.py -f "../profiler/image_classification/profiles1/64/resnet101/graph.txt" -n 4 -s 11000000000 --straight_pipeline -o "./optim/64/resnet101/gpus=4_straight" -b 2500000000 --use_memory_constraint
python convert_graph_to_model.py -f "./optim/64/resnet101/gpus=4_straight/gpus=4.txt" -n resnet101 -a resnet101 -o "./optim/64/resnet101/gpus=4_straight/"

On the optimizer step, I get the following output.

Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932

So, my expectation was the straight pipeline would be roughly similar to the DP timings.

But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.

        model  batch     conf         mean  speed_up
21  resnet101     64   1_conf  1098.136000  1.000000
22  resnet101     64  mp_conf   770.499250  1.425227
23  resnet101     64  dp_conf   304.383375  3.607740

I'd be very grateful if you could help me, figuring out this discrepancy?

I have drawn the gannt charts for,
pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).

It looks to me that each stage is stagnated on the comms for a considerable period of time.

The text was updated successfully, but these errors were encountered:

deepakn94 · 2020-06-15T16:37:08Z

The main reason is that the Gloo peer-to-peer communication primitives are not well optimized. I am hopeful that this problem will at least partially go away when the PyTorch folks upstream the NCCL send and recv primitives, and we can potentially switch to using NCCL throughout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actual results did not match the optimizer expectation #50

Actual results did not match the optimizer expectation #50

nirandaperera commented Jun 8, 2020

deepakn94 commented Jun 15, 2020

Actual results did not match the optimizer expectation #50

Actual results did not match the optimizer expectation #50

Comments

nirandaperera commented Jun 8, 2020

deepakn94 commented Jun 15, 2020