Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf · 2024-12-02T02:39:41Z

On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?

hzjane · 2024-12-03T06:27:36Z

How did you test and come to this conclusion? I can't reproduce it.
I install conda env like this.

conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install omegconf pandas

And I follow this benchmark_util to test on Intel(R) Xeon(R) Platinum 8468.

# config.yaml
low_bit: 'bf16' # 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
  - '1024-128'
test_api:
- "optimize_model"                      # on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16"               # on Intel CPU, to test 'fp32'.

bash run-spr.sh

And get this first-next token latenct(ms) result:

Llama-2-7b-chat-hf	first_token	next_token
sym_int8	1073.4	45.77
bf16	906.13	89.45
fp32	895.54	105.82

fousdfrf · 2024-12-03T14:57:45Z

Here is my approach:

conda create -n llm python=3.9
conda activate llm 
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install --pre --upgrade ipex-llm[all]

I installed the default GPU version of PyTorch provided in eagle-2. However, I specified that it loads the model into memory and runs it on the CPU. After loading the model, I added two lines of code to ea_model.py:

from ipex_llm import optimize_model
base_model = optimize_model(base_model, low_bit="bf16", optimize_llm=False)

I found that:

When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

hzjane · 2024-12-04T02:28:07Z

Hi, @fousdfrf. Can I know the exact code of the program you ran? What specific changes did you make to ea_model.py? And how did you compare the performance? The information you provided is too little. I will get an error Segmentation fault if I run it on my way.

fousdfrf · 2024-12-04T08:06:43Z

If I install the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16, how should I download ipex-llm?

hzjane · 2024-12-04T08:09:35Z

If I install the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16, how should I download ipex-llm?

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

fousdfrf · 2024-12-04T08:13:04Z

Installing collected packages: py-cpuinfo, ipex-llm, intel-cmplr-lib-ur, tabulate, intel-openmp, torch, accelerate, transformers
Attempting uninstall: torch
Found existing installation: torch 2.0.1
Uninstalling torch-2.0.1:
Successfully uninstalled torch-2.0.1
Attempting uninstall: accelerate
Found existing installation: accelerate 0.21.0
Uninstalling accelerate-0.21.0:
Successfully uninstalled accelerate-0.21.0
Attempting uninstall: transformers
Found existing installation: transformers 4.36.2
Uninstalling transformers-4.36.2:
Successfully uninstalled transformers-4.36.2
Successfully installed accelerate-0.23.0 intel-cmplr-lib-ur-2024.2.1 intel-openmp-2024.2.1 ipex-llm-2.2.0b20241203 py-cpuinfo-9.0.0 tabulate-0.9.0 torch-2.1.2+cpu transformers-4.37.0 I found that the GPU version was replaced with the CPU version, but some of my subsequent code still requires inference on the GPU.

hzjane · 2024-12-04T08:25:51Z

I think we currently do not support running on both the cuda GPU and the intel CPU now.

fousdfrf · 2024-12-04T08:43:00Z

So, this is a coincidence. I only install ipex-llm and the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16 and int8 ,
I found that:
When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

hzjane · 2024-12-04T08:47:38Z

Maybe you didn't install the latest ipex-llm? I can get the normal preformance using latest ipex-llm. On your way, the latest ipex-llm maybe can't be installed.

qiuxin2012 added the user issue label Dec 3, 2024

qiuxin2012 assigned hzjane Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using bf16 for inference on a CPU is slower than using float32. #12472

Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf commented Dec 2, 2024

hzjane commented Dec 3, 2024

fousdfrf commented Dec 3, 2024 •

edited

Loading

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024 •

edited

Loading

Using bf16 for inference on a CPU is slower than using float32. #12472

Using bf16 for inference on a CPU is slower than using float32. #12472

Comments

fousdfrf commented Dec 2, 2024

hzjane commented Dec 3, 2024

fousdfrf commented Dec 3, 2024 • edited Loading

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024

fousdfrf commented Dec 4, 2024

hzjane commented Dec 4, 2024 • edited Loading

fousdfrf commented Dec 3, 2024 •

edited

Loading

hzjane commented Dec 4, 2024 •

edited

Loading