-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110
Comments
Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? |
AutoModelForSequenceClassification
|
Chenglei Ye
HIK RESEARCH INST -
NLP
15802263906
***@***.***
Hangzhou, Zhejiang, China
|
---- Replied Message ----
| From | Tong ***@***.***> |
| Date | 11/02/2024 17:27 |
| To | hpcaitech/ColossalAI ***@***.***> |
| Cc | cingtiye ***@***.***>,
Author ***@***.***> |
| Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) |
Hi, could you share us more detail about your code?
Did you use shardformer itself or use any of our examples?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use |
@TongLi3701 Could you please reply to me? |
Hi, we are trying to figure it out. We will have a test on this. Based on my initial guess, it might because of the following part: We will need to add
|
Are you sure? if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
hidden_states = inputs_embeds
device = hidden_states.device |
Firstly, it seems that |
|
[rank1]: Traceback (most recent call last):
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 392, in <module>
[rank1]: train(args)
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 320, in train
[rank1]: trainer.fit(
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
[rank1]: self._train(epoch)
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 133, in _train
[rank1]: reward = self.model(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 220, in forward
[rank1]: return super().forward(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/interface/model.py", line 25, in forward
[rank1]: return self.module(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1388, in forward
[rank1]: transformer_outputs = self.model(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 99, in llama_model_forward
[rank1]: inputs_embeds = self.embed_tokens(input_ids)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]: return F.embedding(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
[rank1]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: TypeError: embedding(): argument 'weight' (position 1) must be Tensor, not NoneType |
Thank you, we'll fix it soon. |
What’s the progress like? |
any Colossalai-ers could help me? Thanks a lot. |
What’s the progress like? |
Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code? |
I substitute the with init_ctx:
if args.use_flash_attn:
model = RewardModel(
args.pretrain,
torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
use_flash_attention_2=True,
)
coordinator.print_on_master(msg="Flash-attention enabled successfully")
else:
model = RewardModel(
args.pretrain,
) |
You can try to run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py by set |
May I ask if you have run |
Is there an existing issue for this bug?
🐛 Describe the bug
pp=2
tp=2
sp=1
zero_stage=0
[rank6]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 93, in llama_model_forward
[rank6]: input_shape = hidden_states.shape[:-1]
[rank6]: AttributeError: 'NoneType' object has no attribute 'shape'
Environment
transformers 4.39.3
torch 2.4.0a0+3bcc3cddb5.nv24.7
colossalai 0.4.5
The text was updated successfully, but these errors were encountered: