-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using bf16 for inference on a CPU is slower than using float32. #12472
Comments
How did you test and come to this conclusion? I can't reproduce it. conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install omegconf pandas And I follow this benchmark_util to test on # config.yaml
low_bit: 'bf16' # 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
- '1024-128'
test_api:
- "optimize_model" # on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16" # on Intel CPU, to test 'fp32'.
bash run-spr.sh And get this first-next token latenct(ms) result:
|
Here is my approach: conda create -n llm python=3.9
conda activate llm
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install --pre --upgrade ipex-llm[all] I installed the default GPU version of PyTorch provided in from ipex_llm import optimize_model
base_model = optimize_model(base_model, low_bit="bf16", optimize_llm=False) I found that:
|
Hi, @fousdfrf. Can I know the exact code of the program you ran? What specific changes did you make to |
If I install the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16, how should I download ipex-llm? |
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu |
Installing collected packages: py-cpuinfo, ipex-llm, intel-cmplr-lib-ur, tabulate, intel-openmp, torch, accelerate, transformers |
I think we currently do not support running on both the cuda GPU and the intel CPU now. |
So, this is a coincidence. I only install ipex-llm and the GPU version of PyTorch but force its backend to use the CPU for inference and want to perform inference using BF16 and int8 , |
Maybe you didn't install the latest ipex-llm? I can get the normal preformance using latest ipex-llm. On your way, the latest ipex-llm maybe can't be installed. |
On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?
The text was updated successfully, but these errors were encountered: