Skip to content

Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using bigdl-llm

License

Notifications You must be signed in to change notification settings

hxsz1997/BigDL

 
 

Repository files navigation

Important

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.


💫 Intel® LLM library for PyTorch*

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.

Note

Latest Update 🔥

  • [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
  • [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
  • [2024/07] We added FP6 support on Intel GPU.
  • [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
  • [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
  • [2024/06] We added support for running RAGFlow with ipex-llm on Intel GPU.
  • [2024/05] ipex-llm now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.
More updates
  • [2024/05] You can now easily run ipex-llm inference, serving and finetuning using the Docker images.
  • [2024/05] You can now install ipex-llm on Windows using just "one command".
  • [2024/04] You can now run Open WebUI on Intel GPU using ipex-llm; see the quickstart here.
  • [2024/04] You can now run Llama 3 on Intel GPU using llama.cpp and ollama with ipex-llm; see the quickstart here.
  • [2024/04] ipex-llm now supports Llama 3 on both Intel GPU and CPU.
  • [2024/04] ipex-llm now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.
  • [2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
  • [2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
  • [2024/02] ipex-llm added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
  • [2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
  • [2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
  • [2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
  • [2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
  • [2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
  • [2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
  • [2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
  • [2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
  • [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
  • [2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
  • [2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
  • [2023/09] ipex-llm tutorial is released.

ipex-llm Performance

See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).

You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.

ipex-llm Demo

See demos of running local LLMs on Intel Iris iGPU, Intel Core Ultra iGPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.

Intel Iris iGPU Intel Core Ultra iGPU Intel Arc dGPU 2-Card Intel Arc dGPUs
llama.cpp (Phi-3-mini Q4_0) Ollama (Mistral-7B Q4_K) TextGeneration-WebUI (Llama3-8B FP8) FastChat (QWen1.5-32B FP6)

Model Accuracy

Please see the Perplexity result below (tested on Wikitext dataset using the script here).

Perplexity sym_int4 q4_k fp6 fp8_e5m2 fp8_e4m3 fp16
Llama-2-7B-chat-hf 6.364 6.218 6.092 6.180 6.098 6.096
Mistral-7B-Instruct-v0.2 5.365 5.320 5.270 5.273 5.246 5.244
Baichuan2-7B-chat 6.734 6.727 6.527 6.539 6.488 6.508
Qwen1.5-7B-chat 8.865 8.816 8.557 8.846 8.530 8.607
Llama-3.1-8B-Instruct 6.705 6.566 6.338 6.383 6.325 6.267
gemma-2-9b-it 7.541 7.412 7.269 7.380 7.268 7.270
Baichuan2-13B-Chat 6.313 6.160 6.070 6.145 6.086 6.031
Llama-2-13b-chat-hf 5.449 5.422 5.341 5.384 5.332 5.329
Qwen1.5-14B-Chat 7.529 7.520 7.367 7.504 7.297 7.334

ipex-llm Quickstart

Docker

  • GPU Inference in C++: running llama.cpp, ollama, OpenWebUI, etc., with ipex-llm on Intel GPU
  • GPU Inference in Python : running HuggingFace transformers, LangChain, LlamaIndex, ModelScope, etc. with ipex-llm on Intel GPU
  • vLLM on GPU: running vLLM serving with ipex-llm on Intel GPU
  • vLLM on CPU: running vLLM serving with ipex-llm on Intel CPU
  • FastChat on GPU: running FastChat serving with ipex-llm on Intel GPU
  • VSCode on GPU: running and developing ipex-llm applications in Python using VSCode on Intel GPU

Use

  • llama.cpp: running llama.cpp (using C++ interface of ipex-llm as an accelerated backend for llama.cpp) on Intel GPU
  • Ollama: running ollama (using C++ interface of ipex-llm as an accelerated backend for ollama) on Intel GPU
  • Llama 3 with llama.cpp and ollama: running Llama 3 on Intel GPU using llama.cpp and ollama with ipex-llm
  • vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
  • FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
  • Serving on multiple Intel GPUs: running ipex-llm serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI
  • Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
  • Axolotl: running ipex-llm in Axolotl for LLM finetuning
  • Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Applications

  • GraphRAG: running Microsoft's GraphRAG using local LLM with ipex-llm
  • RAGFlow: running RAGFlow (an open-source RAG engine) with ipex-llm
  • LangChain-Chatchat: running LangChain-Chatchat (Knowledge Base QA using RAG pipeline) with ipex-llm
  • Coding copilot: running Continue (coding copilot in VSCode) with ipex-llm
  • Open WebUI: running Open WebUI with ipex-llm
  • PrivateGPT: running PrivateGPT to interact with documents with ipex-llm
  • Dify platform: running ipex-llm in Dify(production-ready LLM app development platform)

Install

Code Examples

API Doc

FAQ

Verified Models

Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link
LLaMA 2 link1, link2 link
LLaMA 3 link link
LLaMA 3.1 link link
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
GLM-4 link link
GLM-4V link link
Mistral link link
Mixtral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen2 link link
Qwen-VL link link
Qwen2-Audio link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Phi-3 link link
Phi-3-vision link link
Yuan2 link link
Gemma link link
Gemma2 link
DeciLM-7B link link
Deepseek link link
StableLM link link
CodeGemma link link
Command-R/cohere link link
CodeGeeX2 link link
MiniCPM link link
MiniCPM-V link
MiniCPM-V-2 link
MiniCPM-Llama3-V-2_5 link
MiniCPM-V-2_6 link

Get Support

Footnotes

  1. Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. 2

About

Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using bigdl-llm

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.5%
  • Shell 2.4%
  • Other 1.1%