We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimum 1.21.4 optimum-habana 1.14.0.dev0 transformers 4.45.2 +-----------------------------------------------------------------------------+ | HL-SMI Version: hl-1.18.0-fw-53.1.1.1 | | Driver Version: 1.18.0-ee698fb | |-------------------------------+----------------------+----------------------+
examples
1, download bigscience/bloomz-7b1 weight from: https://huggingface.co/bigscience/bloomz-7b1
2,
cd optimum-habana/examples/language-modeling pip install -r requirements.txt
3,
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \ python ../gaudi_spawn.py \ --use_deepspeed --world_size 8 run_clm.py \ --model_name_or_path /ai_workdir/models/bloomz-7b1 \ --dataset_name tatsu-lab/alpaca \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --do_train \ --do_eval \ --output_dir /ai_workdir/models/bloomz-7b1-clm \ --use_habana \ --use_lazy_mode \ --gradient_checkpointing \ --throughput_warmup_steps 3 \ --deepspeed ./llama2_ds_zero3_config.json \ --gaudi_config_name gaudi_config.json \ --trust_remote_code True \ --overwrite_output_dir \ --block_size 4096 \ --save_strategy epoch
4, The running error log is as follows:
[rank4]: Traceback (most recent call last): [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank4]: return func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor [rank4]: work = group._allgather_base(output_tensor, input_tensor, opts) [rank4]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. [rank4]: During handling of the above exception, another exception occurred: [rank4]: Traceback (most recent call last): [rank4]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module> [rank4]: main() [rank4]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main [rank4]: metrics = trainer.evaluate() [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate [rank4]: output = eval_loop( [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop [rank4]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step [rank4]: raise error [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step [rank4]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True) [rank4]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss [rank4]: outputs = model(**inputs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank4]: return self._call_impl(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank4]: result = forward_call(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward [rank4]: transformer_outputs = self.transformer( [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank4]: return self._call_impl(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank4]: result = forward_call(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward [rank4]: outputs = block( [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank4]: return self._call_impl(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank4]: result = forward_call(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward [rank4]: attn_outputs = self.self_attention( [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank4]: return self._call_impl(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank4]: result = forward_call(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward [rank4]: output_tensor = self.dense(context_layer) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank4]: return self._call_impl(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl [rank4]: args_result = hook(self, args) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank4]: ret_val = func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook [rank4]: self.pre_sub_module_forward_function(module) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank4]: return func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function [rank4]: param_coordinator.fetch_sub_module(sub_module, forward=True) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn [rank4]: return fn(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank4]: ret_val = func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank4]: return func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module [rank4]: self.__all_gather_params(params_to_fetch, forward) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank4]: ret_val = func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params [rank4]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_ [rank4]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank4]: ret_val = func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced [rank4]: handles = _dist_allgather_fn( [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn [rank4]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank4]: ret_val = func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn [rank4]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank4]: return func(*args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor [rank4]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) [rank4]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor [rank4]: return self.all_gather_function(output_tensor=output_tensor, [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank4]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict [rank4]: "args": f"{args}, {kwargs}", [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__ [rank4]: return torch._tensor_str._str(self, tensor_contents=tensor_contents) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str [rank4]: return _str_intern(self, tensor_contents=tensor_contents) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern [rank4]: tensor_str = _tensor_str(self, indent) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str [rank4]: formatter = _Formatter(get_summarized_data(self) if summarize else self) [rank4]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__ [rank4]: nonzero_finite_vals = torch.masked_select( [rank4]: RuntimeError: [Rank:4] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread... [rank4]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure]. [rank4]: [Rank:4] Habana exception raised from compile at graph.cpp:599 [rank5]: Traceback (most recent call last): [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor [rank5]: work = group._allgather_base(output_tensor, input_tensor, opts) [rank5]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. [rank5]: During handling of the above exception, another exception occurred: [rank5]: Traceback (most recent call last): [rank5]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module> [rank5]: main() [rank5]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main [rank5]: metrics = trainer.evaluate() [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate [rank5]: output = eval_loop( [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop [rank5]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step [rank5]: raise error [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step [rank5]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True) [rank5]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss [rank5]: outputs = model(**inputs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank5]: return self._call_impl(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank5]: result = forward_call(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward [rank5]: transformer_outputs = self.transformer( [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank5]: return self._call_impl(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank5]: result = forward_call(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward [rank5]: outputs = block( [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank5]: return self._call_impl(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank5]: result = forward_call(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward [rank5]: attn_outputs = self.self_attention( [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank5]: return self._call_impl(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank5]: result = forward_call(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward [rank5]: output_tensor = self.dense(context_layer) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank5]: return self._call_impl(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl [rank5]: args_result = hook(self, args) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook [rank5]: self.pre_sub_module_forward_function(module) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function [rank5]: param_coordinator.fetch_sub_module(sub_module, forward=True) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn [rank5]: return fn(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module [rank5]: self.__all_gather_params(params_to_fetch, forward) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params [rank5]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_ [rank5]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced [rank5]: handles = _dist_allgather_fn( [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn [rank5]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn [rank5]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor [rank5]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) [rank5]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor [rank5]: return self.all_gather_function(output_tensor=output_tensor, [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank5]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict [rank5]: "args": f"{args}, {kwargs}", [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__ [rank5]: return torch._tensor_str._str(self, tensor_contents=tensor_contents) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str [rank5]: return _str_intern(self, tensor_contents=tensor_contents) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern [rank5]: tensor_str = _tensor_str(self, indent) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str [rank5]: formatter = _Formatter(get_summarized_data(self) if summarize else self) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__ [rank5]: nonzero_finite_vals = torch.masked_select( [rank5]: RuntimeError: [Rank:5] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread... [rank5]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure]. [rank5]: [Rank:5] Habana exception raised from compile at graph.cpp:599 [rank6]: Traceback (most recent call last): [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank6]: return func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor [rank6]: work = group._allgather_base(output_tensor, input_tensor, opts) [rank6]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. [rank6]: During handling of the above exception, another exception occurred: [rank6]: Traceback (most recent call last): [rank6]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module> [rank6]: main() [rank6]: File "/ai_workdir/optimum-habana/examples/language-modeling/run_clm.py", line 662, in main [rank6]: metrics = trainer.evaluate() [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1748, in evaluate [rank6]: output = eval_loop( [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1904, in evaluation_loop [rank6]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2110, in prediction_step [rank6]: raise error [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 2087, in prediction_step [rank6]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True) [rank6]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss [rank6]: outputs = model(**inputs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank6]: result = forward_call(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 559, in forward [rank6]: transformer_outputs = self.transformer( [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank6]: result = forward_call(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 438, in gaudi_bloom_model_forward [rank6]: outputs = block( [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank6]: result = forward_call(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 246, in gaudi_bloom_block_forward [rank6]: attn_outputs = self.self_attention( [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl [rank6]: result = forward_call(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/bloom/modeling_bloom.py", line 205, in gaudi_bloom_attention_forward [rank6]: output_tensor = self.dense(context_layer) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1595, in _call_impl [rank6]: args_result = hook(self, args) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank6]: ret_val = func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook [rank6]: self.pre_sub_module_forward_function(module) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank6]: return func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function [rank6]: param_coordinator.fetch_sub_module(sub_module, forward=True) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn [rank6]: return fn(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank6]: ret_val = func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank6]: return func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 291, in fetch_sub_module [rank6]: self.__all_gather_params(params_to_fetch, forward) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank6]: ret_val = func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 435, in __all_gather_params [rank6]: self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 464, in __all_gather_params_ [rank6]: handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank6]: ret_val = func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1242, in all_gather_coalesced [rank6]: handles = _dist_allgather_fn( [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn [rank6]: return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank6]: ret_val = func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn [rank6]: return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank6]: return func(*args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor [rank6]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) [rank6]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 205, in all_gather_into_tensor [rank6]: return self.all_gather_function(output_tensor=output_tensor, [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank6]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict [rank6]: "args": f"{args}, {kwargs}", [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 473, in __repr__ [rank6]: return torch._tensor_str._str(self, tensor_contents=tensor_contents) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 698, in _str [rank6]: return _str_intern(self, tensor_contents=tensor_contents) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 618, in _str_intern [rank6]: tensor_str = _tensor_str(self, indent) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 350, in _tensor_str [rank6]: formatter = _Formatter(get_summarized_data(self) if summarize else self) [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 138, in __init__ [rank6]: nonzero_finite_vals = torch.masked_select( [rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread... [rank6]: Check $HABANA_LOGS/ for detailsGraph compile failed. synStatus=synStatus 26 [Generic failure]. [rank6]: [Rank:6] Habana exception raised from compile at graph.cpp:599 [rank2]: Traceback (most recent call last): [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank2]: return func(*args, **kwargs) [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3208, in all_gather_into_tensor [rank2]: work = group._allgather_base(output_tensor, input_tensor, opts) [rank2]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. ......
Can successfully run bigscience/bloomz-7b1 with fine-tuning.
The text was updated successfully, but these errors were encountered:
Is this on Gaudi2? Did it use to work with SynapseAI 1.17?
Sorry, something went wrong.
On Gaudi2D.
Listing... Done habanalabs-container-runtime/jammy,now 1.18.0-524 amd64 [installed] habanalabs-dkms/jammy,now 1.18.0-524 all [installed] habanalabs-firmware-odm/jammy,now 1.18.0-524 amd64 [installed] habanalabs-firmware-tools/jammy,now 1.18.0-524 amd64 [installed] habanalabs-firmware/jammy,now 1.18.0-524 amd64 [installed] habanalabs-graph/jammy,now 1.18.0-524 amd64 [installed] habanalabs-qual/jammy,now 1.18.0-524 amd64 [installed] habanalabs-rdma-core/jammy,now 1.18.0-524 all [installed] habanalabs-thunk/jammy,now 1.18.0-524 all [installed]
Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39) Firmware [SPI] Version : Preboot version hl-gaudi2-1.18.0-fw-53.1.1-sec-9 (Oct 02 2024 - 11:52:39)
@11989890 Is it still an issue? And when does the error happen, during training or during evaluation?
No branches or pull requests
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1,
download bigscience/bloomz-7b1 weight from: https://huggingface.co/bigscience/bloomz-7b1
2,
3,
4, The running error log is as follows:
Expected behavior
Can successfully run bigscience/bloomz-7b1 with fine-tuning.
The text was updated successfully, but these errors were encountered: