RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2

James-Lu-none · 2024-11-02T10:03:38Z

cd /root/workspace/github/optimum-habana/examples/text-generation/
python run_generation.py \
--model_name_or_path /root/workspace/model/meta-llama/Llama-3.1-8B/ \
--use_hpu_graphs \
--use_kv_cache \
--max_new_tokens 100 \
--do_sample \
--prompt "Here is my prompt"

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
11/02/2024 10:02:30 - INFO - __main__ - Single-device run.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  9.44it/s]
Some weights of GaudiLlamaForCausalLM were not initialized from the model checkpoint at /root/workspace/model/meta-llama/Llama-3.1-8B/ and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM       : 527938484 KB
------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/workspace/github/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
    main()
  File "/root/workspace/github/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/github/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/github/optimum-habana/examples/text-generation/utils.py", line 267, in setup_model
    model = model.eval().to(args.device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2871, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert
    return t.to(
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB

The text was updated successfully, but these errors were encountered:

James-Lu-none · 2024-11-03T02:27:11Z

huggingface/optimum-habana#1469
In short: add --bf16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2

James-Lu-none commented Nov 2, 2024

James-Lu-none commented Nov 3, 2024

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2

Comments

James-Lu-none commented Nov 2, 2024

James-Lu-none commented Nov 3, 2024