Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using bf16 for inference on a CPU is slower than using float32. #12472

Open
fousdfrf opened this issue Dec 2, 2024 · 9 comments
Open

Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf opened this issue Dec 2, 2024 · 9 comments
Assignees

Comments

@fousdfrf
Copy link

fousdfrf commented Dec 2, 2024

On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?

@hzjane
Copy link
Contributor

hzjane commented Dec 3, 2024

How did you test and come to this conclusion? I can't reproduce it.
I install conda env like this.

conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install omegconf pandas

And I follow this benchmark_util to test on Intel(R) Xeon(R) Platinum 8468.

# config.yaml
low_bit: 'bf16' # 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
  - '1024-128'
test_api:
- "optimize_model"                      # on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16"               # on Intel CPU, to test 'fp32'.

bash run-spr.sh

And get this first-next token latenct(ms) result:

Llama-2-7b-chat-hf first_token next_token
sym_int8 1073.4 45.77
bf16 906.13 89.45
fp32 895.54 105.82

@fousdfrf
Copy link
Author

fousdfrf commented Dec 3, 2024

Here is my approach:

conda create -n llm python=3.9
conda activate llm 
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install --pre --upgrade ipex-llm[all]

I installed the default GPU version of PyTorch provided in eagle-2. However, I specified that it loads the model into memory and runs it on the CPU. After loading the model, I added two lines of code to ea_model.py:

from ipex_llm import optimize_model
base_model = optimize_model(base_model, low_bit="bf16", optimize_llm=False)

I found that:

  • When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
  • However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

@hzjane
Copy link
Contributor

hzjane commented Dec 4, 2024

Hi, @fousdfrf. Can I know the exact code of the program you ran? What specific changes did you make to ea_model.py? And how did you compare the performance? The information you provided is too little. I will get an error Segmentation fault if I run it on my way.

@fousdfrf
Copy link
Author

fousdfrf commented Dec 4, 2024

If I install the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16, how should I download ipex-llm?

@hzjane
Copy link
Contributor

hzjane commented Dec 4, 2024

If I install the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16, how should I download ipex-llm?

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

@fousdfrf
Copy link
Author

fousdfrf commented Dec 4, 2024

Installing collected packages: py-cpuinfo, ipex-llm, intel-cmplr-lib-ur, tabulate, intel-openmp, torch, accelerate, transformers
Attempting uninstall: torch
Found existing installation: torch 2.0.1
Uninstalling torch-2.0.1:
Successfully uninstalled torch-2.0.1
Attempting uninstall: accelerate
Found existing installation: accelerate 0.21.0
Uninstalling accelerate-0.21.0:
Successfully uninstalled accelerate-0.21.0
Attempting uninstall: transformers
Found existing installation: transformers 4.36.2
Uninstalling transformers-4.36.2:
Successfully uninstalled transformers-4.36.2
Successfully installed accelerate-0.23.0 intel-cmplr-lib-ur-2024.2.1 intel-openmp-2024.2.1 ipex-llm-2.2.0b20241203 py-cpuinfo-9.0.0 tabulate-0.9.0 torch-2.1.2+cpu transformers-4.37.0 I found that the GPU version was replaced with the CPU version, but some of my subsequent code still requires inference on the GPU.

@hzjane
Copy link
Contributor

hzjane commented Dec 4, 2024

I think we currently do not support running on both the cuda GPU and the intel CPU now.

@fousdfrf
Copy link
Author

fousdfrf commented Dec 4, 2024

So, this is a coincidence. I only install ipex-llm and the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16 and int8 ,
I found that:
When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

@hzjane
Copy link
Contributor

hzjane commented Dec 4, 2024

Maybe you didn't install the latest ipex-llm? I can get the normal preformance using latest ipex-llm. On your way, the latest ipex-llm maybe can't be installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants