Important
bigdl-llm
has now become ipex-llm
(see the migration guide here); you may find the original BigDL
project here.
< English | 中文 >
IPEX-LLM
is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
- It is built on top of the excellent work of
llama.cpp
,transformers
,bitsandbytes
,vLLM
,qlora
,AutoGPTQ
,AutoAWQ
, etc. - It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
- 70+ models have been optimized/verified on
ipex-llm
(e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.
Project updates
- [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
- [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
- [2024/07] We added FP6 support on Intel GPU.
- [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
- [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
- [2024/06] We added support for running RAGFlow with
ipex-llm
on Intel GPU. - [2024/05]
ipex-llm
now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here. - [2024/05] You can now easily run
ipex-llm
inference, serving and finetuning using the Docker images. - [2024/05] You can now install
ipex-llm
on Windows using just "one command". - [2024/04] You can now run Open WebUI on Intel GPU using
ipex-llm
; see the quickstart here. - [2024/04] You can now run Llama 3 on Intel GPU using
llama.cpp
andollama
withipex-llm
; see the quickstart here. - [2024/04]
ipex-llm
now supports Llama 3 on both Intel GPU and CPU. - [2024/04]
ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here. - [2024/02]
ipex-llm
now supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llm
through Text-Generation-WebUI GUI. - [2024/02]
ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here). - [2023/12]
ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llm
now supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llm
is available. - [2023/11]
ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llm
tutorial is released.
See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm
below.
Intel Core Ultra (Series 1) iGPU | Intel Core Ultra (Series 2) NPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
|
|
|
|
Ollama (Mistral-7B Q4_K) |
HuggingFace (Llama3.2-3B SYM_INT4) |
TextGeneration-WebUI (Llama3-8B FP8) |
FastChat (QWen1.5-32B FP6) |
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
You may follow the Benchmarking Guide to run ipex-llm
performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
---|---|---|---|---|---|---|
Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
- GPU Inference in C++: running
llama.cpp
,ollama
, etc., withipex-llm
on Intel GPU - GPU Inference in Python : running HuggingFace
transformers
,LangChain
,LlamaIndex
,ModelScope
, etc. withipex-llm
on Intel GPU - vLLM on GPU: running
vLLM
serving withipex-llm
on Intel GPU - vLLM on CPU: running
vLLM
serving withipex-llm
on Intel CPU - FastChat on GPU: running
FastChat
serving withipex-llm
on Intel GPU - VSCode on GPU: running and developing
ipex-llm
applications in Python using VSCode on Intel GPU
- NPU: running
ipex-llm
on Intel NPU in both Python and C++ - llama.cpp: running llama.cpp (using C++ interface of
ipex-llm
) on Intel GPU - Ollama: running ollama (using C++ interface of
ipex-llm
) on Intel GPU - PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of
ipex-llm
) on Intel GPU for Windows and Linux - vLLM: running
ipex-llm
in vLLM on both Intel GPU and CPU - FastChat: running
ipex-llm
in FastChat serving on on both Intel GPU and CPU - Serving on multiple Intel GPUs: running
ipex-llm
serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI - Text-Generation-WebUI: running
ipex-llm
inoobabooga
WebUI - Axolotl: running
ipex-llm
in Axolotl for LLM finetuning - Benchmarking: running (latency and throughput) benchmarks for
ipex-llm
on Intel CPU and GPU
- GraphRAG: running Microsoft's
GraphRAG
using local LLM withipex-llm
- RAGFlow: running
RAGFlow
(an open-source RAG engine) withipex-llm
- LangChain-Chatchat: running
LangChain-Chatchat
(Knowledge Base QA using RAG pipeline) withipex-llm
- Coding copilot: running
Continue
(coding copilot in VSCode) withipex-llm
- Open WebUI: running
Open WebUI
withipex-llm
- PrivateGPT: running
PrivateGPT
to interact with documents withipex-llm
- Dify platform: running
ipex-llm
inDify
(production-ready LLM app development platform)
- Windows GPU: installing
ipex-llm
on Windows with Intel GPU - Linux GPU: installing
ipex-llm
on Linux with Intel GPU - For more details, please refer to the full installation guide
-
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
-
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
-
- Low-bit models: saving and loading
ipex-llm
low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) - GGUF: directly loading GGUF models into
ipex-llm
- AWQ: directly loading AWQ models into
ipex-llm
- GPTQ: directly loading GPTQ models into
ipex-llm
- Low-bit models: saving and loading
- Tutorials
Over 70 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Model | CPU Example | GPU Example | NPU Example |
---|---|---|---|
LLaMA | link1, link2 | link | |
LLaMA 2 | link1, link2 | link | Python link, C++ link |
LLaMA 3 | link | link | Python link, C++ link |
LLaMA 3.1 | link | link | |
LLaMA 3.2 | link | Python link, C++ link | |
LLaMA 3.2-Vision | link | ||
ChatGLM | link | ||
ChatGLM2 | link | link | |
ChatGLM3 | link | link | |
GLM-4 | link | link | |
GLM-4V | link | link | |
GLM-Edge | link | Python link | |
Mistral | link | link | |
Mixtral | link | link | |
Falcon | link | link | |
MPT | link | link | |
Dolly-v1 | link | link | |
Dolly-v2 | link | link | |
Replit Code | link | link | |
RedPajama | link1, link2 | ||
Phoenix | link1, link2 | ||
StarCoder | link1, link2 | link | |
Baichuan | link | link | |
Baichuan2 | link | link | Python link |
InternLM | link | link | |
InternVL2 | link | ||
Qwen | link | link | |
Qwen1.5 | link | link | |
Qwen2 | link | link | Python link, C++ link |
Qwen2.5 | link | Python link, C++ link | |
Qwen-VL | link | link | |
Qwen2-VL | link | ||
Qwen2-Audio | link | ||
Aquila | link | link | |
Aquila2 | link | link | |
MOSS | link | ||
Whisper | link | link | |
Phi-1_5 | link | link | |
Flan-t5 | link | link | |
LLaVA | link | link | |
CodeLlama | link | link | |
Skywork | link | ||
InternLM-XComposer | link | ||
WizardCoder-Python | link | ||
CodeShell | link | ||
Fuyu | link | ||
Distil-Whisper | link | link | |
Yi | link | link | |
BlueLM | link | link | |
Mamba | link | link | |
SOLAR | link | link | |
Phixtral | link | link | |
InternLM2 | link | link | |
RWKV4 | link | ||
RWKV5 | link | ||
Bark | link | link | |
SpeechT5 | link | ||
DeepSeek-MoE | link | ||
Ziya-Coding-34B-v1.0 | link | ||
Phi-2 | link | link | |
Phi-3 | link | link | |
Phi-3-vision | link | link | |
Yuan2 | link | link | |
Gemma | link | link | |
Gemma2 | link | ||
DeciLM-7B | link | link | |
Deepseek | link | link | |
StableLM | link | link | |
CodeGemma | link | link | |
Command-R/cohere | link | link | |
CodeGeeX2 | link | link | |
MiniCPM | link | link | Python link, C++ link |
MiniCPM3 | link | ||
MiniCPM-V | link | ||
MiniCPM-V-2 | link | link | |
MiniCPM-Llama3-V-2_5 | link | Python link | |
MiniCPM-V-2_6 | link | link | Python link |
StableDiffusion | link | ||
Bce-Embedding-Base-V1 | Python link | ||
Speech_Paraformer-Large | Python link |
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
Footnotes
-
Performance varies by use, configuration and other factors.
ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2