RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2 #1469

James-Lu-none · 2024-11-02T10:11:44Z

System Info

Image: vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
harware: Habana Labs Gaudi HL205 Mezzanine Card with HL-2000 AI Training Accelerator [Gaudi] (rev 01) x8

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python run_generation.py \
--model_name_or_path /root/workspace/model/meta-llama/Llama-3.1-8B/ \
--use_hpu_graphs \
--use_kv_cache \
--max_new_tokens 100 \
--do_sample \
--prompt "Here is my prompt"

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
11/02/2024 10:02:30 - INFO - __main__ - Single-device run.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  9.44it/s]
Some weights of GaudiLlamaForCausalLM were not initialized from the model checkpoint at /root/workspace/model/meta-llama/Llama-3.1-8B/ and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM       : 527938484 KB
------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/workspace/github/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
    main()
  File "/root/workspace/github/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/github/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/github/optimum-habana/examples/text-generation/utils.py", line 267, in setup_model
    model = model.eval().to(args.device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2871, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert
    return t.to(
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB

Expected behavior

HL-2000 AI Training Accelerator has 32 GB of HBM so i assume that it should able to run the model meta-llama/Llama-3.1-8B/ (8*16/8=>~16GB required)

The text was updated successfully, but these errors were encountered:

regisss · 2024-11-02T10:51:46Z

The command you ran only uses one device and doesn't cast all the model parameters to bf16.
You should add --bf16 to make sure all the model parameters are casted to bf16.
You could also use DeepSpeed to take advantage of all your devices, but it will be much slower so I don't recommend it for a model that can fit on one device.

James-Lu-none · 2024-11-03T02:20:48Z

thank you for the help and advises! adding --bf16 works! but i still have some questions

why don't cast all the model parameters to bf16 lead to this error?
why i have to explictly case it to bf16? it seems like all the parameters are bf16 in Llama-3.1-8B

James-Lu-none added the bug Something isn't working label Nov 2, 2024

James-Lu-none mentioned this issue Nov 3, 2024

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB NTUT-intel-Gaudi/records#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2 #1469

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2 #1469

James-Lu-none commented Nov 2, 2024 •

edited

Loading

regisss commented Nov 2, 2024

James-Lu-none commented Nov 3, 2024

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2 #1469

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB #2 #1469

Comments

James-Lu-none commented Nov 2, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

regisss commented Nov 2, 2024

James-Lu-none commented Nov 3, 2024

James-Lu-none commented Nov 2, 2024 •

edited

Loading