[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

tinafengfun · 2025-01-13T01:58:44Z

During calibration on Llama 3.1 70B on Tensor parallel = 2 (Gaudi2), step 4 scale failed due to out of memory issue.
Should add this line in the scripts to enable model offloading on cpu to mitigate this issue.
After reduce batch size to 16 and enabling cpu offloading, the calibration was successful.
Issue reproduce step:

root@linseernode:/home/1.19/vllm-hpu-extension/calibration# cat quantilization.sh
#!/bin/bash
model_path=/home/H3C/Llama-3-70B-Instruct
data_path=/home/quantization/open_orca/open_orca_gpt4_tokenized_llama.calibration_1000.pkl.gz
model_output_path=/home/Llama-3-70B-Instruct-FP8_tp2

./calibrate_model.sh -m $model_path  -d $data_path  -o $model_output_path -b 128 -t 2 -l 4096

A fix to address it:
Batch size adjustment
./calibrate_model.sh -m $model_path -d $data_path -o $model_output_path -b 16 -t 2 -l 4096

step-4-quantize-scales.py add cpu-offloading, line27

1 ###############################################################################
 2 # Copyright (C) 2024 Habana Labs, Ltd. an Intel Company
 3 ###############################################################################
 4 import vllm
 5 import torch
 6 import argparse
 7 import os
 8 os.environ["EXPERIMENTAL_WEIGHT_SHARING"] = "0"
 9 os.environ["VLLM_SKIP_WARMUP"] = "true"
10
11
12 if __name__ == "__main__":
13
14     parser = argparse.ArgumentParser()
15     parser.add_argument("--model", type=str, required=True)
16     parser.add_argument("--tensor-parallel-size", type=int, default=1)
17
18     args = parser.parse_args()
19
20     llm = vllm.LLM(
21         model=args.model,
22         tensor_parallel_size=args.tensor_parallel_size,
23         enforce_eager=True,
24         dtype=torch.bfloat16,
25         quantization='inc',
**26         kv_cache_dtype="fp8_inc",
27         weights_load_device="cpu")**
28
29     llm.llm_engine.model_executor.shutdown()

The text was updated successfully, but these errors were encountered:

michalkuligowski · 2025-01-17T15:04:54Z

Hi @tinafengfun, batch_size=128 might be too big for running in tensor_parallel=2 environment for llama-3.1-70B, and batch_size decrease or tensor_parallel increase (like in the Tip from https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration) is needed. We will look into this shortly

afierka-intel · 2025-01-20T17:29:06Z

Hello @tinafengfun.

Thank you for your input! :) We agree that proposed batch is too high for used model and TP=2. In vLLM logs we can even find log entry which stands:

INFO 01-20 18:42:43 executor_base.py:101] Maximum concurrency for 2048 tokens per request: 72.31x

so max batch size in this case is 72.

Your proposition to load model weights on CPU is also a great idea. This help to reduce memory pressure on HPU loading BF16 weights of the model to RAM and store only FP8 weights on HPU memory.

Can you prepare PR with proposed changes?

Thank you in advance!

Best regards,
Artur

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

tinafengfun commented Jan 13, 2025 •

edited

Loading

michalkuligowski commented Jan 17, 2025

afierka-intel commented Jan 20, 2025

[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

Comments

tinafengfun commented Jan 13, 2025 • edited Loading

michalkuligowski commented Jan 17, 2025

afierka-intel commented Jan 20, 2025

tinafengfun commented Jan 13, 2025 •

edited

Loading