[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

cingtiye · 2024-11-02T08:00:36Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

pp=2
tp=2
sp=1
zero_stage=0

[rank6]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 93, in llama_model_forward
[rank6]: input_shape = hidden_states.shape[:-1]
[rank6]: AttributeError: 'NoneType' object has no attribute 'shape'

Environment

transformers 4.39.3
torch 2.4.0a0+3bcc3cddb5.nv24.7
colossalai 0.4.5

TongLi3701 · 2024-11-02T09:27:01Z

Hi, could you share us more detail about your code?

Did you use shardformer itself or use any of our examples?

cingtiye · 2024-11-02T10:17:33Z

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 ***@***.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong ***@***.***> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI ***@***.***> | | Cc | cingtiye ***@***.***>, Author ***@***.***> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

cingtiye · 2024-11-04T01:52:13Z

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

cingtiye · 2024-11-05T08:05:52Z

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI _@**._> | | Cc | cingtiye _@.>, Author @._> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@_.*>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

@TongLi3701 Could you please reply to me?

TongLi3701 · 2024-11-06T14:46:24Z

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

cingtiye · 2024-11-08T05:54:35Z

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

Are you sure?
The 91 line is already hidden_states = inputs_embeds.

  if inputs_embeds is None:
      inputs_embeds = self.embed_tokens(input_ids)
  hidden_states = inputs_embeds
  device = hidden_states.device

Edenzzzz · 2024-11-10T21:17:25Z

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found in the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs.
To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

cingtiye · 2024-11-11T02:32:55Z

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs. To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

cingtiye · 2024-11-11T07:43:25Z

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 392, in <module>
[rank1]:     train(args)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 320, in train
[rank1]:     trainer.fit(
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
[rank1]:     self._train(epoch)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 133, in _train
[rank1]:     reward = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 220, in forward
[rank1]:     return super().forward(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/interface/model.py", line 25, in forward
[rank1]:     return self.module(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1388, in forward
[rank1]:     transformer_outputs = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 99, in llama_model_forward
[rank1]:     inputs_embeds = self.embed_tokens(input_ids)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]:     return F.embedding(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
[rank1]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: TypeError: embedding(): argument 'weight' (position 1) must be Tensor, not NoneType

flybird11111 · 2024-11-11T07:49:40Z

Thank you, we'll fix it soon.

cingtiye · 2024-11-12T10:37:28Z

Thank you, we'll fix it soon.

What’s the progress like?

cingtiye · 2024-11-14T02:07:47Z

Thank you, we'll fix it soon.

any Colossalai-ers could help me? Thanks a lot.

cingtiye · 2024-11-18T06:19:32Z

Thank you, we'll fix it soon.

What’s the progress like?

flybird11111 · 2024-11-18T07:17:37Z

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

cingtiye · 2024-11-18T07:23:37Z

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

I substitute the RewardModel by LlamaForSequenceClassification. In fact, it can't correctly run even if I don't substitute RewardModel.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

    with init_ctx:
        if args.use_flash_attn:
            model = RewardModel(
                args.pretrain,
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
            )
            coordinator.print_on_master(msg="Flash-attention enabled successfully")
        else:
            model = RewardModel(
                args.pretrain,
            )

cingtiye · 2024-11-18T07:27:55Z

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

You can try to run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py by set pp>1 in one node.

cingtiye · 2024-11-19T07:04:01Z

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

May I ask if you have run train_rm.py? Have you encountered the same issue as me when pp > 1？

cingtiye added the bug Something isn't working label Nov 2, 2024

cingtiye mentioned this issue Nov 18, 2024

[hotfix] fix parameter shape checking #6124

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

cingtiye commented Nov 2, 2024

TongLi3701 commented Nov 2, 2024

cingtiye commented Nov 2, 2024 via email

cingtiye commented Nov 4, 2024

cingtiye commented Nov 5, 2024

TongLi3701 commented Nov 6, 2024

cingtiye commented Nov 8, 2024

Edenzzzz commented Nov 10, 2024 •

edited

Loading

cingtiye commented Nov 11, 2024

cingtiye commented Nov 11, 2024

flybird11111 commented Nov 11, 2024

cingtiye commented Nov 12, 2024

cingtiye commented Nov 14, 2024

cingtiye commented Nov 18, 2024

flybird11111 commented Nov 18, 2024

cingtiye commented Nov 18, 2024

cingtiye commented Nov 18, 2024

cingtiye commented Nov 19, 2024

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

Comments

cingtiye commented Nov 2, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

TongLi3701 commented Nov 2, 2024

cingtiye commented Nov 2, 2024 via email

cingtiye commented Nov 4, 2024

cingtiye commented Nov 5, 2024

TongLi3701 commented Nov 6, 2024

cingtiye commented Nov 8, 2024

Edenzzzz commented Nov 10, 2024 • edited Loading

cingtiye commented Nov 11, 2024

cingtiye commented Nov 11, 2024

flybird11111 commented Nov 11, 2024

cingtiye commented Nov 12, 2024

cingtiye commented Nov 14, 2024

cingtiye commented Nov 18, 2024

flybird11111 commented Nov 18, 2024

cingtiye commented Nov 18, 2024

cingtiye commented Nov 18, 2024

cingtiye commented Nov 19, 2024

Edenzzzz commented Nov 10, 2024 •

edited

Loading