Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actual results did not match the optimizer expectation #50

Open
nirandaperera opened this issue Jun 8, 2020 · 1 comment
Open

Actual results did not match the optimizer expectation #50

nirandaperera opened this issue Jun 8, 2020 · 1 comment

Comments

@nirandaperera
Copy link

Hi,

I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.

I ran the following commands for the profiler and the optimizer.

CUDA_VISIBLE_DEVICES=4 python main.py -a "resnet101"  -b 64 --data_dir "$HOME/data/imagenet-mini/" --profile_directory "profiles1/64"

python optimizer_graph_hierarchical.py -f "../profiler/image_classification/profiles1/64/resnet101/graph.txt" -n 4 -s 11000000000 --straight_pipeline -o "./optim/64/resnet101/gpus=4_straight" -b 2500000000 --use_memory_constraint
python convert_graph_to_model.py -f "./optim/64/resnet101/gpus=4_straight/gpus=4.txt" -n resnet101 -a resnet101 -o "./optim/64/resnet101/gpus=4_straight/"

On the optimizer step, I get the following output.

Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932

So, my expectation was the straight pipeline would be roughly similar to the DP timings.

But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.

        model  batch     conf         mean  speed_up
21  resnet101     64   1_conf  1098.136000  1.000000
22  resnet101     64  mp_conf   770.499250  1.425227
23  resnet101     64  dp_conf   304.383375  3.607740

I'd be very grateful if you could help me, figuring out this discrepancy?

I have drawn the gannt charts for,
pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).

It looks to me that each stage is stagnated on the comms for a considerable period of time.

@deepakn94
Copy link
Collaborator

The main reason is that the Gloo peer-to-peer communication primitives are not well optimized. I am hopeful that this problem will at least partially go away when the PyTorch folks upstream the NCCL send and recv primitives, and we can potentially switch to using NCCL throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants