Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Can I freeze pytorchjob training pods and migrate them to other nodes? #356

Open
Shuai-Xie opened this issue Sep 22, 2021 · 9 comments
Open

Comments

@Shuai-Xie
Copy link

No description provided.

@gaocegege
Copy link
Member

You can do it with checkpoint

@Shuai-Xie
Copy link
Author

Yes, @gaocegege. Checkpoints can do this job.

In this way, we have to define what and when to save.

  • what: users have to tell us what they want to record, e.g. epoch, model_state_dict, optimizer_state_dict, and on.
  • when: this affects when we resume training and the total training cost of the task inside the pytorchjob.

Are there any ways to make this migration more smooth and seamless like a stateless service?

I mean,

  • we don't need users to tell us what they want to record.
  • the training process is identical to the training without migration.

Currently, I launch a thread to save the checkpoint when container lifecycle preStop sends a signal. But in this way, users have to change their codes to tell us what they want to record.

Thanks a lot.

@gaocegege
Copy link
Member

we don't need users to tell us what they want to record.

I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it?

@Shuai-Xie
Copy link
Author

I have no ideas and agree with you that this is not easy.

Saving a checkpoint seems to go around this problem now.

Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning

image

@gaocegege
Copy link
Member

It is invasive to the user code, personally, I do not think it is practical.

@Shuai-Xie
Copy link
Author

Yes. When we provide a service, we don't want users to change their habits.

This problem seems unsolvable now.

However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily.

For example,

model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
...

Thanks a lot.

@gaocegege
Copy link
Member

gaocegege commented Sep 24, 2021

it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library.

@Shuai-Xie
Copy link
Author

Shuai-Xie commented Sep 24, 2021

Thanks a lot.

  • For pass-by-reference types like model, optimizer or dict, this may be easy.
  • But for pass-by-value types like int or float, for now, I don't know how to trace their values properly.

@Shuai-Xie
Copy link
Author

Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration.

Both solutions use the preStop container hook and record the signal in a shared file.

This repo has two branches.

  • master: implements MigratableVariable with combination function and singleton class, which is more user-friendly and can be used in multiple python modules freely.
  • develop: implements Migrator class, which is an older version and has limitations noted in README.

To use the migration feature.

  • master
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)
  • develop
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}

# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume:  # note: migrate_ckpt has higher priority than args.ckpt
    migrator.load_ckpt()  # load ckpt at all ranks

Could you please help me review the design?

Many Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants