Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[H3C] quantization step 4 failed with Llama 3.1 70B on TP2 config. #74

Open
tinafengfun opened this issue Jan 13, 2025 · 2 comments
Open

Comments

@tinafengfun
Copy link

tinafengfun commented Jan 13, 2025

During calibration on Llama 3.1 70B on Tensor parallel = 2 (Gaudi2), step 4 scale failed due to out of memory issue.
Should add this line in the scripts to enable model offloading on cpu to mitigate this issue.
After reduce batch size to 16 and enabling cpu offloading, the calibration was successful.
Issue reproduce step:

root@linseernode:/home/1.19/vllm-hpu-extension/calibration# cat quantilization.sh
#!/bin/bash
model_path=/home/H3C/Llama-3-70B-Instruct
data_path=/home/quantization/open_orca/open_orca_gpt4_tokenized_llama.calibration_1000.pkl.gz
model_output_path=/home/Llama-3-70B-Instruct-FP8_tp2

./calibrate_model.sh -m $model_path  -d $data_path  -o $model_output_path -b 128 -t 2 -l 4096

A fix to address it:
Batch size adjustment
./calibrate_model.sh -m $model_path -d $data_path -o $model_output_path -b 16 -t 2 -l 4096

step-4-quantize-scales.py add cpu-offloading, line27

1 ###############################################################################
 2 # Copyright (C) 2024 Habana Labs, Ltd. an Intel Company
 3 ###############################################################################
 4 import vllm
 5 import torch
 6 import argparse
 7 import os
 8 os.environ["EXPERIMENTAL_WEIGHT_SHARING"] = "0"
 9 os.environ["VLLM_SKIP_WARMUP"] = "true"
10
11
12 if __name__ == "__main__":
13
14     parser = argparse.ArgumentParser()
15     parser.add_argument("--model", type=str, required=True)
16     parser.add_argument("--tensor-parallel-size", type=int, default=1)
17
18     args = parser.parse_args()
19
20     llm = vllm.LLM(
21         model=args.model,
22         tensor_parallel_size=args.tensor_parallel_size,
23         enforce_eager=True,
24         dtype=torch.bfloat16,
25         quantization='inc',
**26         kv_cache_dtype="fp8_inc",
27         weights_load_device="cpu")**
28
29     llm.llm_engine.model_executor.shutdown()
@michalkuligowski
Copy link
Contributor

Hi @tinafengfun, batch_size=128 might be too big for running in tensor_parallel=2 environment for llama-3.1-70B, and batch_size decrease or tensor_parallel increase (like in the Tip from https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration) is needed. We will look into this shortly

@afierka-intel
Copy link
Contributor

Hello @tinafengfun.

Thank you for your input! :) We agree that proposed batch is too high for used model and TP=2. In vLLM logs we can even find log entry which stands:

INFO 01-20 18:42:43 executor_base.py:101] Maximum concurrency for 2048 tokens per request: 72.31x

so max batch size in this case is 72.

Your proposition to load model weights on CPU is also a great idea. This help to reduce memory pressure on HPU loading BF16 weights of the model to RAM and store only FP8 weights on HPU memory.

Can you prepare PR with proposed changes?

Thank you in advance!

Best regards,
Artur

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants