The training hangs after reloading one of master/worker pods #359

dmitsf · 2021-10-28T14:07:06Z

Hello!
I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading can be different, usually, it's due to Google Cloud Engine node rescheduling. Also, I tried to kill pods myself - the behavior was the same.
Can I avoid this behavior and make training tolerant to pods' reloading?

gaocegege · 2021-10-29T02:31:58Z

Can you tell us the pytorch version?

dmitsf · 2021-10-29T12:27:11Z

I use pytorch 1.9.0.

gaocegege · 2021-10-29T14:13:24Z

Are you using torch.distributed.run?

dmitsf · 2021-10-29T16:34:38Z

I don't use it at the moment.
I followed mnist example to adjust my training script.

gaocegege · 2021-10-30T01:49:30Z

Can you please show us the script and the YAML file? PyTorch 1.9 introduced elastic training and it may hang.

gaocegege added kind/question area/engprod labels Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The training hangs after reloading one of master/worker pods #359

The training hangs after reloading one of master/worker pods #359

dmitsf commented Oct 28, 2021

gaocegege commented Oct 29, 2021

dmitsf commented Oct 29, 2021

gaocegege commented Oct 29, 2021

dmitsf commented Oct 29, 2021 •

edited

Loading

gaocegege commented Oct 30, 2021

The training hangs after reloading one of master/worker pods #359

The training hangs after reloading one of master/worker pods #359

Comments

dmitsf commented Oct 28, 2021

gaocegege commented Oct 29, 2021

dmitsf commented Oct 29, 2021

gaocegege commented Oct 29, 2021

dmitsf commented Oct 29, 2021 • edited Loading

gaocegege commented Oct 30, 2021

dmitsf commented Oct 29, 2021 •

edited

Loading