-
Notifications
You must be signed in to change notification settings - Fork 143
Different DDP training results of PytorchJob and Bare Metal #354
Comments
Interesting. It should be the same since you are using the same dataset and the same random seed. Can you please post the code here? /cc @zw0610 @kubeflow/wg-training-leads |
Sure @gaocegege. I've posted the code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example. Thanks for your kind reply. |
Hi @gaocegege, I've got more results and proved that random state isn't the problem. Check the random values generated on BM and PJThis function below is used to validate that BM and PJ generate the same random values when using the same seed. Each process launched by DDP training will invoke this function and print the random values. def print_random_vals(rank, num=1):
for n in range(num):
print('rank {} group {}, random: {}, np: {}, torch: {}'.format(
rank, n, random.random(), np.random.rand(1), torch.rand(1))) Here are the results. (1) Different seeds on BM (A reference)# seed = 1
# 48
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 2 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 3 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
# 49
rank 5 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 7 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 4 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 6 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
# seed = 10
# 48
rank 2 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 3 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 0 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
# 49
rank 7 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 5 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 6 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 4 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
# seed = 100
# 48
rank 2 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 3 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 0 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
# 49
rank 6 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 7 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 4 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 5 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117]) (2) Different seeds on PJ (A target)Still, we turn on the host network and print the random values generated by each process of the DDP training with different numbers of Pods (2/4/8, align to the experiments in 2. PJ DDP training). Clearly, they are same to BM. # seed = 1
# 8 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# 4 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# 2 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# seed = 10
# 4 Pods
rank 0 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
...
# seed = 100
# 8 Pods
rank 0 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
... More PJ training results with different seedsStill turn on the host network. (1) seed = 10# launch.py
# 8 Pod * 1 card
training seconds: 19.675018310546875
best_acc: 55.5
# 4 Pod * 2 card
training seconds: 19.006036043167114
best_acc: 55.5
# 2 Pod * 4 card
training seconds: 19.31661581993103 # same to BM
best_acc: 65.69
# mp tcp
# 8 Pod * 1 card
training seconds: 22.436177015304565
best_acc: 55.5
# 4 Pod * 2 card
training seconds: 21.974145889282227
best_acc: 59.18
# 2 Pod * 4 card
training seconds: 22.730929851531982 # same to BM
best_acc: 65.69 (2) seed=100# launch.py
# 8 Pod * 1 card
training seconds: 19.475943565368652
best_acc: 61.46
# 4 Pod * 2 card
training seconds: 19.309614658355713
best_acc: 59.18
# 2 Pod * 4 card
training seconds: 20.99683380126953 # same to BM
best_acc: 69.3
# mp tcp
# 8 Pod * 1 card
training seconds: 22.409621953964233
best_acc: 59.18
# 4 Pod * 2 card
training seconds: 22.37164807319641
best_acc: 57.45
# 2 Pod * 4 card
training seconds: 22.644285917282104 # same to BM
best_acc: 69.3 As before, only Others get quite random results, for example, the results of What confuses me most is that I still get different results even if I use the official recommended way to launch 8 Pods with the same seed. For example, when seed = 1, PJ has a different result to BM no matter the host network is true or false.
# host network = false
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 20.179935455322266
best_acc: 67.97
# host network = ture
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 18.878486394882202
best_acc: 67.97 For now, I can only doubt that the PJ Thanks a lot. |
Things are getting weirder @gaocegege. PytorchJob version may have an effect on the training reproduction #355 (comment). Please let me know if I write the wrong code. Thanks a lot. |
Thanks for your detailed reply!
Do you mean PyTorch version? |
😂 Yes @gaocegege. I'm sorry to make this mistake. I'll change it right away. By the way, are there any clues now? Many thanks. |
Dear developers, I got a new problem.
I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.
Experiment settings
Experiment results
1. BM DDP training
I record the training process of three ways to launch DDP training.
torch.distributed.launch
with default init_methodenv://
mp.spawn()
withtcp://
mp.spawn()
withfile://
And the results are below.
(1.1) 2 machine,nproc_per_node=4,nnodes=2
(1.2) 2 machine,nproc_per_node=2,nnodes=4
(1.3) 2 machine,nproc_per_node=1,nnodes=8
The 3 experiments above show that
nproc_per_node * nnodes
) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodesnnodes
. Because the training loss is reproduced and the test accuracies are equal.2. PJ DDP traing
When using PJ DDP training, I also want to see the same results of BM.
However, the experiment results makes me confused.
Before doing the same experiment group like BM, I use the recommended way to launch DDP training.
The YAML file is blow.
It launch 8 pods, which is similar to the experiment (1.3).
However, I get the results below, which is quite different from BM results.
At first, I doubt BM OS and PytorchJob Pod OS generates different random states.
However, the following experiments show it's not the key.
We set
hostNetwork=ture
in all the experiments below.(2.1) 2 Pod * 4 cards
(2.2) 4 Pod * 2 cards
(2.3) 8 Pod * 1 cards
Only exp (2.1) gets the same results like BM. It really makes me confused.
Dear developers, please let me know if I made some mistakes.
Thanks a lot.
Happy Mid-Autumn Festival!
The text was updated successfully, but these errors were encountered: