You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to convert the checkpoint obtained from asynchronous saving of torch_dist to the original torch format, but using convert.py directly results in an error. Could there be an issue with my usage?
Traceback (most recent call last):
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/convert.py", line 158, in <module>
Loader exited, exiting saver
main()
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/convert.py", line 151, in main
loader.load_checkpoint(queue, args)
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 381, in load_checkpoint
_load_checkpoint(queue, args)
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 243, in _load_checkpoint
all_models = [get_models(tp_size, md.params_dtype)]
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/tools/checkpoint/loader_mcore.py", line 161, in get_models
load_checkpoint(model_, None, None)
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 1090, in load_checkpoint
state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint(
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 851, in _load_base_checkpoint
return _load_global_dist_base_checkpoint(
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/training/checkpointing.py", line 779, in _load_global_dist_base_checkpoint
state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_name, load_strategy, strict=args.dist_ckpt_strictness)
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/core/dist_checkpointing/serialization.py", line 126, in load
local_metadata, global_metadata = determine_global_metadata(sharded_state_dict)
File "/mnt/self-define/zhangyi/home/zy/zjlab-megatron/Megatron-LM-core_r0.9.0/megatron/core/dist_checkpointing/validation.py", line 497, in determine_global_metadata
global_metadata = [None] * torch.distributed.get_world_size()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1831, in get_world_size
return _get_group_size(group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 863, in _get_group_size
default_pg = _get_default_group()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1024, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
The text was updated successfully, but these errors were encountered:
I am trying to convert the checkpoint obtained from asynchronous saving of torch_dist to the original torch format, but using convert.py directly results in an error. Could there be an issue with my usage?
python tools/checkpoint/convert.py --model-type GPT --load-dir /mnt/self-define/output/output-Llama3_1-8B-pretrain/checkpoint/pretrain-mcore-llama3-1-8B-lr-3e-5-minlr-3e-6-bs-1-gbs-1024-seqlen-8192-pr-bf16-tp-2-pp-2-cp-1-ac-none-do-true-sp-false-cp-1-ts-140000 --save-dir /mnt/self-define/zhangyi/temp/output/output-Llama3_1-8B-pretrain/checkpoint
The error:
The text was updated successfully, but these errors were encountered: