Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to convert torch_dist format checkpoint to torch format? #1291

Open
zhangyilalala opened this issue Nov 19, 2024 · 0 comments

Comments

@zhangyilalala
Copy link

I am trying to convert the checkpoint obtained from asynchronous saving of torch_dist to the original torch format, but using convert.py directly results in an error. Could there be an issue with my usage?

python tools/checkpoint/convert.py --model-type GPT --load-dir /mnt/self-define/output/output-Llama3_1-8B-pretrain/checkpoint/pretrain-mcore-llama3-1-8B-lr-3e-5-minlr-3e-6-bs-1-gbs-1024-seqlen-8192-pr-bf16-tp-2-pp-2-cp-1-ac-none-do-true-sp-false-cp-1-ts-140000 --save-dir /mnt/self-define/zhangyi/temp/output/output-Llama3_1-8B-pretrain/checkpoint

The error:

Traceback (most recent call last):
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/convert.py", line 158, in <module>
Loader exited, exiting saver
    main()
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/convert.py", line 151, in main
    loader.load_checkpoint(queue, args)
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 381, in load_checkpoint
    _load_checkpoint(queue, args)
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 243, in _load_checkpoint
    all_models = [get_models(tp_size, md.params_dtype)]
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 161, in get_models
    load_checkpoint(model_, None, None)
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 1090, in load_checkpoint
    state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint(
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 851, in _load_base_checkpoint
    return _load_global_dist_base_checkpoint(
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 779, in _load_global_dist_base_checkpoint
    state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_name, load_strategy, strict=args.dist_ckpt_strictness)
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/core/dist_checkpointing/serialization.py", line 126, in load
    local_metadata, global_metadata = determine_global_metadata(sharded_state_dict)
  File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/core/dist_checkpointing/validation.py", line 497, in determine_global_metadata
    global_metadata = [None] * torch.distributed.get_world_size()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1831, in get_world_size
    return _get_group_size(group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 863, in _get_group_size
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1024, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant