Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float] #16

Open
sunzhufeng12345 opened this issue Aug 26, 2024 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@sunzhufeng12345
Copy link

我使用官方提供的脚本和数据集先后运行了python pre_tokenize_glm4.py
python sort_and_group.py --group_size 8 --train_file /home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/datasets
得到了attention_masks_pack.json ,inputs_pack.npy等文件
运行训练脚本 ./glm4_longwriter.sh 时,遇到与 DeepSpeedZeroConfig 配置相关的 ValidationError。错误是由于 stage3_prefetch_bucket_size 的输入类型无效,期望为整数但接收到浮点数。

训练日志:
[2024-08-26 09:58:48,719] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-26 09:58:49,793] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 09:58:50,631] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,737] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,784] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,799] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:51,320] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 09:58:52,754] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:52,859] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,039] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,301] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:59:10,505] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 283, num_elems = 9.40B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.15s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.18s/it]
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.158402919769287 seconds
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
[rank4]: train()
[rank4]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 126, in train
[rank4]: trainer.train(resume_from_checkpoint=False)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank4]: return inner_training_loop(
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 2095, in _inner_training_loop
[rank4]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank4]: result = self._prepare_deepspeed(*args)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank4]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank4]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank4]: self._initialize_params(copy.copy(self._param_dict))
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank4]: self.zero_config = get_zero_config(param_dict)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank4]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank4]: super().init(**data)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/pydantic/main.py", line 193, in init
[rank4]: self.pydantic_validator.validate_python(data, self_instance=self)
[rank4]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank4]: stage3_prefetch_bucket_size
[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank4]: For further information visit https://errors.pydantic.dev/2.8/v/int_from_float

@sunzhufeng12345
Copy link
Author

而且如果我把stage3.json中的 "stage3_prefetch_bucket_size": "auto",改为 "stage3_prefetch_bucket_size": 15099494,运行会出现如下错误:
[2024-08-26 10:00:37,155] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,235] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,236] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,301] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,331] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,358] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,386] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:38,665] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,716] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,783] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,791] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,810] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,868] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,891] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,891] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-26 10:00:39,001] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:40,846] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:40,934] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,119] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,127] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,138] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,236] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,240] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,249] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:01:00,375] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 283, num_elems = 9.40B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00, 1.21s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00, 1.23s/it]
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.942378044128418 seconds
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7372360229492188 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.803518056869507 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.814899444580078 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.760847568511963 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.765498161315918 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.718514919281006 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.747807502746582 seconds
Parameter Offload: Total persistent parameters: 516096 in 121 params
wandb: W&B API key is configured. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.7
wandb: Run data is saved locally in /home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/wandb/run-20240826_100248-p37tgoc6
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run glm4_longwriter_szf
wandb: ⭐️ View project at https://wandb.ai/beijingdaxue/huggingface
wandb: 🚀 View run at https://wandb.ai/beijingdaxue/huggingface/runs/p37tgoc6
0%| | 0/2752 [00:00<?, ?it/s][rank7]: Traceback (most recent call last):
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
[rank7]: train()
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 126, in train
[rank7]: trainer.train(resume_from_checkpoint=False)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank7]: return inner_training_loop(
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in training_step
[rank7]: loss = self.compute_loss(model, inputs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 3363, in compute_loss
[rank7]: outputs = model(**inputs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]: return self._call_impl(*args, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]: return forward_call(*args, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank7]: ret_val = func(*args, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank7]: loss = self.module(*inputs, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]: return self._call_impl(*args, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank7]: result = forward_call(*args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 994, in forward
[rank7]: transformer_outputs = self.transformer(
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]: return self._call_impl(*args, **kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank7]: result = forward_call(*args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 882, in forward
[rank7]: full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 784, in get_masks
[rank7]: full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)

@badarrrr
Copy link

我还遇到了这个:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
loss = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 994, in forward
transformer_outputs = self.transformer(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 882, in forward
full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 784, in get_masks
full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (32768) must match the size of tensor b (6) at non-singleton dimension 1

@sunzhufeng12345
Copy link
Author

我还遇到了这个: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward loss = self.module(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 994, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 882, in forward full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 784, in get_masks full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~ RuntimeError: The size of tensor a (32768) must match the size of tensor b (6) at non-singleton dimension 1

是的,我现在也是到这一步卡住了,目前和你的报错一样

@badarrrr
Copy link

(T_T)

@bys0318
Copy link
Member

bys0318 commented Aug 28, 2024

我们目前提供的GLM-4-9B模型训练代码需要transformers==4.33.0的环境,更高的transformers环境可能导致错误。为了支持packing training,请用patch/下提供的modeling_chatglm.py文件替换原始模型的modeling_chatglm.py.

@sunzhufeng12345
Copy link
Author

我们目前提供的GLM-4-9B模型训练代码需要transformers==4.33.0的环境,更高的transformers环境可能导致错误。为了支持packing training,请用patch/下提供的modeling_chatglm.py文件替换原始模型的modeling_chatglm.py.

目前已经换成4.33.0,而且modeling_chatglm.py也已替换,但是出现如下报错:
rank7]: Traceback (most recent call last):
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
[rank7]: train()
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 110, in train
[rank7]: model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path,
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
[rank7]: return model_class.from_pretrained(
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2954, in from_pretrained
[rank7]: model = cls(config, *model_args, **model_kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]: f(module, *args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 919, in init
[rank7]: self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]: f(module, *args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 846, in init
[rank7]: self.encoder = init_method(GLMTransformer, config, **init_kwargs)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/torch/nn/utils/init.py", line 54, in skip_init
[rank7]: return module_cls(*args, **kwargs).to_empty(device=final_device)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]: f(module, *args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 676, in init
[rank7]: self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 676, in
[rank7]: self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 674, in build_layer
[rank7]: return GLMBlock(config, layer_number, device=device)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]: f(module, *args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 607, in init
[rank7]: self.self_attention = SelfAttention(config, layer_number, device=device)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]: f(module, *args, **kwargs)
[rank7]: File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init
[rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)
[rank7]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
[rank7]: return super().getattribute(key)
[rank7]: AttributeError: 'ChatGLMConfig' object has no attribute '_attn_implementation'. Did you mean: 'attn_implementation'?

@bys0318
Copy link
Member

bys0318 commented Aug 28, 2024

你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init
[rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。

@badarrrr
Copy link

请问训练支持glm-4-9b-chat吗?不是glm-4-9b

@bys0318
Copy link
Member

bys0318 commented Aug 28, 2024

我们建议从glm-4-9b(base)模型开始进行混训(通用SFT数据+LongWriter-6k数据)。直接从glm-4-9b-chat训练的效果会大打折扣。

@bys0318 bys0318 self-assigned this Aug 28, 2024
@sunzhufeng12345
Copy link
Author

你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init [rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。

我试了确实是,替换了原来的文件后,运行train文件,就会使用的还是原来的modeling_chatglm.py文件

@bys0318
Copy link
Member

bys0318 commented Aug 29, 2024

你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init [rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。

我试了确实是,替换了原来的文件后,运行train文件,就会使用的还是原来的modeling_chatglm.py文件

你需要在load时候传入参数trust_remote_code=True

@badarrrr
Copy link

Traceback (most recent call last):
File "/gemini/code/train/main.py", line 130, in
train()
File "/gemini/code/train/main.py", line 126, in train
trainer.train(resume_from_checkpoint=False)
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1948, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 3328, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 3373, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
loss = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 803, in forward
transformer_outputs = self.transformer(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 707, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 541, in forward
layer_ret = torch.utils.checkpoint.checkpoint(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint
ret = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 454, in forward
attention_output, kv_cache = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 317, in forward
query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 145, in apply_rotary_pos_emb
rope_cache = rope_cache[:sq]
xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
~~~~~~~~~~~~~~~ <--- HERE
x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288

我换成了glm-4-9b模型,也换了modeling_chatglm.py文件,但是现在报了一个新的错

@bys0318 bys0318 added the bug Something isn't working label Aug 29, 2024
@bys0318
Copy link
Member

bys0318 commented Sep 3, 2024

@sunzhufeng12345 @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants