Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

按照教程,一步一步弄的,到了训练PPO的时候, 卡到 CUDA error: device-side assert triggered #54

Open
karl-tao-zhang opened this issue Sep 9, 2023 · 3 comments

Comments

@karl-tao-zhang
Copy link

Using pad_token, but it is not set yet.
Loading base model for ppo training...
加载base
加载lora
加载ppo
WARNING:root:A <class 'peft.peft_model.PeftModelForCausalLM'> model is loaded from '/root/autodl-tmp/LLM/weights/sft_lora', and no v_head weight is found. This IS expected if you are not resuming PPO training.
Loading base model for reward model...
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some weights of BaichuanForSequenceClassification were not initialized from the model checkpoint at baichuan-inc/baichuan-7B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
开始训练
0it [00:00, ?it/s]---------------------
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0
0it [00:10, ?it/s]
Traceback (most recent call last):
File "rl_training.py", line 331, in
response_tensors = ppo_trainer.generate(
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/ppo_trainer.py", line 446, in generate
return self._generate_batched(
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/ppo_trainer.py", line 503, in _generate_batched
generations = self.accelerator.unwrap_model(self.model).generate(**padded_inputs, **generation_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/trl/models/modeling_value_head.py", line 198, in generate
return self.pretrained_model.generate(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/peft_model.py", line 975, in generate
outputs = self.base_model.generate(**kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
return module._hf_hook.post_forward(module, output)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 305, in post_forward
output = send_to_device(output, self.input_device, skip_keys=self.skip_keys)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 160, in send_to_device
{
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 161, in
k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 167, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "rl_training.py", line 364, in
print(question_tensors)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 426, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 327, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 111, in init
value_str = "{}".format(value)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 872, in format
return self.item().format(format_spec)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@karl-tao-zhang
Copy link
Author

karl-tao-zhang commented Sep 9, 2023

CUDA_VISIBLE_DEVICES=0,1,2,3 python rl_training.py
--base_model_name baichuan-inc/baichuan-7B
--merged_sft_model_path /root/autodl-tmp/LLM/weights/sft_lora
--sft_model_lora_path /root/autodl-tmp/LLM/weights/sft_lora
--reward_model_lora_path /root/autodl-tmp/LLM/weights/rm_lora
--adafactor False
--save_freq 10
--output_max_length 256
--batch_size 2
--gradient_accumulation_steps 2
--batched_gen True
--ppo_epochs 4
--seed 0
--learning_rate 1e-5
--early_stopping True
--output_dir /root/autodl-tmp/LLM/weights/ppo_lora \

@karl-tao-zhang
Copy link
Author

4张3090 显存不够换到了 4张A40, 出现上述错误,
出现错误后, 我去 trl的issues找了找相关的代码, 说是要这么解决吗?
tokenizer.eos_token_id = model.config.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

@karl-tao-zhang
Copy link
Author

1张卡才行

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant