-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to load universal_checkpoint with deepspeed integreation #33157
Comments
Here's my deepspeed config json: {
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 16,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"no_pipeline_parallel": true,
"load_universal_checkpoint": true
} |
Another related issue: microsoft/DeepSpeed#5405 |
Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had a different world size. Before I start with the PR, I would like to ensure that NOT loading these “rng_state” files does not have any side-effects. |
We can skip these rng_state and add a warning. |
Sure feel free to open a PR! |
System Info
transformers
version: 4.44.2Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The Universal Checkpointing feature allows loading with different world sizes. However, when using the Hugging Face
Trainer
, the loading of the converted universal checkpoint fails.The failure seems to be due to
HfTrainerDeepSpeedConfig
not correctly handling the"load_universal_checkpoint": true
or"universal_checkpoint": true
arguments in the DeepSpeed configuration. Consequently, theload_universal_checkpoint
function returnsFalse
.Related Issues:
universal_checkpoint_info
in the Accelerate+Deepspeed Checkpoint microsoft/DeepSpeed#5430Expected behavior
Universal checkpoint should be loaded correctly.
The text was updated successfully, but these errors were encountered: