Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigscience / bloomz-7b1 finetune error #1426

Open
4 tasks
11989890 opened this issue Oct 16, 2024 · 3 comments
Open
4 tasks

bigscience / bloomz-7b1 finetune error #1426

11989890 opened this issue Oct 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@11989890
Copy link

System Info

optimum                  1.21.4
optimum-habana           1.14.0.dev0
transformers              4.45.2

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.18.0-fw-53.1.1.1          |
| Driver Version:                                     1.18.0-ee698fb          |
|-------------------------------+----------------------+----------------------+

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

1,
download bigscience/bloomz-7b1 weight from: https://huggingface.co/bigscience/bloomz-7b1

2,

cd optimum-habana/examples/language-modeling
pip install -r requirements.txt

3,

PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \
python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_clm.py \
--model_name_or_path  /ai_workdir/models/bloomz-7b1 \
--dataset_name  tatsu-lab/alpaca \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--do_train \
--do_eval \
--output_dir /ai_workdir/models/bloomz-7b1-clm \
--use_habana \
--use_lazy_mode \
--gradient_checkpointing \
--throughput_warmup_steps 3 \
--deepspeed ./llama2_ds_zero3_config.json \
--gaudi_config_name gaudi_config.json \
--trust_remote_code True \
--overwrite_output_dir \
--block_size 4096 \
--save_strategy epoch

4, The running error log is as follows:

[rank4]: Traceback (most recent call last):
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank4]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank4]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

[rank4]: During handling of the above exception, another exception occurred:

[rank4]: Traceback (most recent call last):
[rank4]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank4]:     main()
[rank4]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank4]:     metrics = trainer.evaluate()
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank4]:     output = eval_loop(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank4]:     losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank4]:     raise error
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank4]:     loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank4]:     outputs = model(**inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank4]:     transformer_outputs = self.transformer(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank4]:     outputs = block(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank4]:     attn_outputs = self.self_attention(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank4]:     output_tensor = self.dense(context_layer)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank4]:     args_result = hook(self, args)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank4]:     self.pre_sub_module_forward_function(module)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank4]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank4]:     self.__all_gather_params(params_to_fetch, forward)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank4]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank4]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank4]:     handles = _dist_allgather_fn(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank4]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank4]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank4]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank4]:     return self.all_gather_function(output_tensor=output_tensor,
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank4]:     msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank4]:     "args": f"{args}, {kwargs}",
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank4]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank4]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank4]:     tensor_str = _tensor_str(self, indent)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank4]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank4]:     nonzero_finite_vals = torch.masked_select(
[rank4]: RuntimeError: [Rank:4] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank4]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank4]: [Rank:4] Habana exception raised from compile at graph.cpp:599
[rank5]: Traceback (most recent call last):
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank5]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank5]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

[rank5]: During handling of the above exception, another exception occurred:

[rank5]: Traceback (most recent call last):
[rank5]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank5]:     main()
[rank5]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank5]:     metrics = trainer.evaluate()
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank5]:     output = eval_loop(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank5]:     losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank5]:     raise error
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank5]:     loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank5]:     outputs = model(**inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank5]:     transformer_outputs = self.transformer(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank5]:     outputs = block(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank5]:     attn_outputs = self.self_attention(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank5]:     output_tensor = self.dense(context_layer)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank5]:     args_result = hook(self, args)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank5]:     self.pre_sub_module_forward_function(module)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank5]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank5]:     self.__all_gather_params(params_to_fetch, forward)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank5]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank5]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank5]:     handles = _dist_allgather_fn(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank5]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank5]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank5]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank5]:     return self.all_gather_function(output_tensor=output_tensor,
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank5]:     msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank5]:     "args": f"{args}, {kwargs}",
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank5]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank5]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank5]:     tensor_str = _tensor_str(self, indent)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank5]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank5]:     nonzero_finite_vals = torch.masked_select(
[rank5]: RuntimeError: [Rank:5] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank5]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank5]: [Rank:5] Habana exception raised from compile at graph.cpp:599
[rank6]: Traceback (most recent call last):
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank6]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank6]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

[rank6]: During handling of the above exception, another exception occurred:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank6]:     main()
[rank6]:   File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main
[rank6]:     metrics = trainer.evaluate()
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate
[rank6]:     output = eval_loop(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop
[rank6]:     losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step
[rank6]:     raise error
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step
[rank6]:     loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
[rank6]:     outputs = model(**inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward
[rank6]:     transformer_outputs = self.transformer(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward
[rank6]:     outputs = block(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward
[rank6]:     attn_outputs = self.self_attention(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward
[rank6]:     output_tensor = self.dense(context_layer)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank6]:     args_result = hook(self, args)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank6]:     self.pre_sub_module_forward_function(module)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank6]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module
[rank6]:     self.__all_gather_params(params_to_fetch, forward)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params
[rank6]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_
[rank6]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced
[rank6]:     handles = _dist_allgather_fn(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank6]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank6]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank6]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor
[rank6]:     return self.all_gather_function(output_tensor=output_tensor,
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank6]:     msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict
[rank6]:     "args": f"{args}, {kwargs}",
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__
[rank6]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str
[rank6]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern
[rank6]:     tensor_str = _tensor_str(self, indent)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str
[rank6]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__
[rank6]:     nonzero_finite_vals = torch.masked_select(
[rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
[rank6]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure].
[rank6]: [Rank:6] Habana exception raised from compile at graph.cpp:599
[rank2]: Traceback (most recent call last):
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor
[rank2]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank2]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].
......

Expected behavior

Can successfully run bigscience/bloomz-7b1 with fine-tuning.

@11989890 11989890 added the bug Something isn't working label Oct 16, 2024
@regisss
Copy link
Collaborator

regisss commented Oct 16, 2024

Is this on Gaudi2? Did it use to work with SynapseAI 1.17?

@11989890
Copy link
Author

On Gaudi2D.

apt list --installed habanalabs*

Listing... Done
habanalabs-container-runtime/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-dkms/jammy,now 1.18.0-524 all [installed]
habanalabs-firmware-odm/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-firmware-tools/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-firmware/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-graph/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-qual/jammy,now 1.18.0-524 amd64 [installed]
habanalabs-rdma-core/jammy,now 1.18.0-524 all [installed]
habanalabs-thunk/jammy,now 1.18.0-524 all [installed]

hl-smi -L | grep SPI

    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
    Firmware [SPI] Version          : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

@11989890 Is it still an issue? And when does the error happen, during training or during evaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants