-
Notifications
You must be signed in to change notification settings - Fork 143
Can I freeze pytorchjob training pods and migrate them to other nodes? #356
Comments
You can do it with checkpoint |
Yes, @gaocegege. Checkpoints can do this job. In this way, we have to define what and when to save.
Are there any ways to make this migration more smooth and seamless like a stateless service? I mean,
Currently, I launch a thread to save the checkpoint when container lifecycle Thanks a lot. |
I do not think it is easy. As you know, container dynamic migration is not mature now. Tools like CRIU do not work well. Do you have any idea about it? |
I have no ideas and agree with you that this is not easy. Saving a checkpoint seems to go around this problem now. Also, this paper has discussed this problem. Gandiva: Introspective Cluster Scheduling for Deep Learning |
It is invasive to the user code, personally, I do not think it is practical. |
Yes. When we provide a service, we don't want users to change their habits. This problem seems unsolvable now. However, if this requirement is necessary, we may have to design a more friendly python library using decorator or combination functions to register some values to be migratable easily. For example, model = migratedVaribale(model)
optimizer = migratedVaribale(optimizer)
... Thanks a lot. |
it is a complicated issue, I think. I am glad to review your design proposal if you are interested in such a library. |
Thanks a lot.
|
Hi @gaocegege, I've designed two migration solutions here: https://github.com/Shuai-Xie/pytorchjob-migration. Both solutions use the This repo has two branches.
To use the migration feature.
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}
# migration
from migration import MigratableVariable
model = MigratableVariable(model)
optimizer = MigratableVariable(optimizer)
metircs = MigratableVariable(metircs)
# metircs to be recorded
metircs = {'epoch': -1, 'best_epoch': -1, 'best_acc': 0.}
# migration
from migration import migrator
migrator.register('model', model)
migrator.register('optimizer', optimizer)
migrator.register('metircs', metircs)
migrator.listening()
if migrator.resume: # note: migrate_ckpt has higher priority than args.ckpt
migrator.load_ckpt() # load ckpt at all ranks Could you please help me review the design? Many Thanks. |
No description provided.
The text was updated successfully, but these errors were encountered: