Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

Open
1 task done
cingtiye opened this issue Nov 2, 2024 · 17 comments
Open
1 task done

[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 #6110

cingtiye opened this issue Nov 2, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@cingtiye
Copy link

cingtiye commented Nov 2, 2024

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

pp=2
tp=2
sp=1
zero_stage=0

[rank6]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 93, in llama_model_forward
[rank6]: input_shape = hidden_states.shape[:-1]
[rank6]: AttributeError: 'NoneType' object has no attribute 'shape'

Environment

transformers 4.39.3
torch 2.4.0a0+3bcc3cddb5.nv24.7
colossalai 0.4.5

@cingtiye cingtiye added the bug Something isn't working label Nov 2, 2024
@TongLi3701
Copy link
Member

Hi, could you share us more detail about your code?

Did you use shardformer itself or use any of our examples?

@cingtiye
Copy link
Author

cingtiye commented Nov 2, 2024 via email

@cingtiye
Copy link
Author

cingtiye commented Nov 4, 2024

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

@cingtiye
Copy link
Author

cingtiye commented Nov 5, 2024

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI _@**._> | | Cc | cingtiye _@.>, Author @._> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@_.*>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

@TongLi3701 Could you please reply to me?

@TongLi3701
Copy link
Member

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

@cingtiye
Copy link
Author

cingtiye commented Nov 8, 2024

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

Are you sure?
The 91 line is already hidden_states = inputs_embeds.

  if inputs_embeds is None:
      inputs_embeds = self.embed_tokens(input_ids)
  hidden_states = inputs_embeds
  device = hidden_states.device

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Nov 10, 2024

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found in the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs.
To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

@cingtiye
Copy link
Author

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs. To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

@cingtiye
Copy link
Author

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 392, in <module>
[rank1]:     train(args)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 320, in train
[rank1]:     trainer.fit(
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
[rank1]:     self._train(epoch)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 133, in _train
[rank1]:     reward = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 220, in forward
[rank1]:     return super().forward(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/interface/model.py", line 25, in forward
[rank1]:     return self.module(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1388, in forward
[rank1]:     transformer_outputs = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 99, in llama_model_forward
[rank1]:     inputs_embeds = self.embed_tokens(input_ids)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]:     return F.embedding(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
[rank1]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: TypeError: embedding(): argument 'weight' (position 1) must be Tensor, not NoneType

@flybird11111
Copy link
Contributor

Thank you, we'll fix it soon.

@cingtiye
Copy link
Author

Thank you, we'll fix it soon.

What’s the progress like?

@cingtiye
Copy link
Author

Thank you, we'll fix it soon.

any Colossalai-ers could help me? Thanks a lot.

@cingtiye
Copy link
Author

Thank you, we'll fix it soon.

What’s the progress like?

@flybird11111
Copy link
Contributor

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

@cingtiye
Copy link
Author

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

I substitute the RewardModel by LlamaForSequenceClassification. In fact, it can't correctly run even if I don't substitute RewardModel.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

    with init_ctx:
        if args.use_flash_attn:
            model = RewardModel(
                args.pretrain,
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
            )
            coordinator.print_on_master(msg="Flash-attention enabled successfully")
        else:
            model = RewardModel(
                args.pretrain,
            )

@cingtiye
Copy link
Author

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

You can try to run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py by set pp>1 in one node.

@cingtiye
Copy link
Author

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

May I ask if you have run train_rm.py? Have you encountered the same issue as me when pp > 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants