PytorchJob DDP training will stop if I delete a worker pod #364

Shuai-Xie · 2021-11-20T15:37:05Z

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   11.80.0.36   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   11.80.0.37   11.71.1.160
mnist-ddp-worker-1   1/1     Running   0          2m55s   11.80.0.38   11.71.1.160
mnist-ddp-worker-2   1/1     Running   0          89s     11.80.0.46   11.71.1.160

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

The text was updated successfully, but these errors were encountered:

gaocegege · 2021-11-21T01:17:52Z

This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator

Shuai-Xie · 2021-11-22T08:21:10Z

haolei, gege @gaocegege

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchJob DDP training will stop if I delete a worker pod #364

PytorchJob DDP training will stop if I delete a worker pod #364

Shuai-Xie commented Nov 20, 2021

gaocegege commented Nov 21, 2021

Shuai-Xie commented Nov 22, 2021

PytorchJob DDP training will stop if I delete a worker pod #364

PytorchJob DDP training will stop if I delete a worker pod #364

Comments

Shuai-Xie commented Nov 20, 2021

gaocegege commented Nov 21, 2021

Shuai-Xie commented Nov 22, 2021