You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
Hi, everyone.
I want to test the failure tolerance of PytorchJob.
I started a PytorchJob with 1 master and 3 workers.
It trains fine.
Then I deleted a worker.
As I set
restartPolicy: OnFailure
, this pod will restart quickly with the same namemnist-ddp-worker-1
.But sadly, I can't see this newborn worker join the DDP training.
Thanks.
The text was updated successfully, but these errors were encountered: