Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Shuai-Xie · 2021-09-22T13:01:22Z

No description provided.

gaocegege · 2021-09-23T02:45:26Z

You can do it with checkpoint

Shuai-Xie · 2021-09-24T06:25:26Z

Yes, @gaocegege. Checkpoints can do this job.

In this way, we have to define what and when to save.

what: users have to tell us what they want to record, e.g. epoch, model_state_dict, optimizer_state_dict, and on.
when: this affects when we resume training and the total training cost of the task inside the pytorchjob.

Are there any ways to make this migration more smooth and seamless like a stateless service?

I mean,

we don't need users to tell us what they want to record.
the training process is identical to the training without migration.

Currently, I launch a thread to save the checkpoint when container lifecycle preStop sends a signal. But in this way, users have to change their codes to tell us what they want to record.

Thanks a lot.

gaocegege · 2021-09-24T06:28:07Z

we don't need users to tell us what they want to record.

I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it?

Shuai-Xie · 2021-09-24T06:47:44Z

I have no ideas and agree with you that this is not easy.

Saving a checkpoint seems to go around this problem now.

Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning

gaocegege · 2021-09-24T06:48:54Z

It is invasive to the user code, personally, I do not think it is practical.

Shuai-Xie · 2021-09-24T08:26:13Z

Yes. When we provide a service, we don't want users to change their habits.

This problem seems unsolvable now.

However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily.

For example,

model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
...

Thanks a lot.

gaocegege · 2021-09-24T08:34:50Z

it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library.

Shuai-Xie · 2021-09-24T09:29:58Z

Thanks a lot.

For pass-by-reference types like model, optimizer or dict, this may be easy.
But for pass-by-value types like int or float, for now, I don't know how to trace their values properly.

Shuai-Xie · 2021-09-26T15:18:50Z

Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration.

Both solutions use the preStop container hook and record the signal in a shared file.

This repo has two branches.

master: implements MigratableVariable with combination function and singleton class, which is more user-friendly and can be used in multiple python modules freely.
develop: implements Migrator class, which is an older version and has limitations noted in README.

To use the migration feature.

master

# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)

develop

# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume:  # note: migrate_ckpt has higher priority than args.ckpt
    migrator.load_ckpt()  # load ckpt at all ranks

Could you please help me review the design?

Many Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Shuai-Xie commented Sep 22, 2021

gaocegege commented Sep 23, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021 •

edited

Loading

Shuai-Xie commented Sep 24, 2021 •

edited

Loading

Shuai-Xie commented Sep 26, 2021

Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Comments

Shuai-Xie commented Sep 22, 2021

gaocegege commented Sep 23, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Shuai-Xie commented Sep 24, 2021

gaocegege commented Sep 24, 2021 • edited Loading

Shuai-Xie commented Sep 24, 2021 • edited Loading

Shuai-Xie commented Sep 26, 2021

gaocegege commented Sep 24, 2021 •

edited

Loading

Shuai-Xie commented Sep 24, 2021 •

edited

Loading