From 099486afb72d4aa3aad9672beaa0e0043a2fc8c5 Mon Sep 17 00:00:00 2001 From: Jason Dai Date: Mon, 8 Jul 2024 20:18:41 +0800 Subject: [PATCH] Update README.md (#11530) --- README.md | 20 ++++++++++--------- .../GPU/HuggingFace/More-Data-Types/README.md | 6 +++--- 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 618b0148881..c83e613ded8 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,8 @@ > - ***50+ models** have been optimized/verified on `ipex-llm` (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list [here](#verified-models).* ## Latest Update 🔥 +- [2024/07] We added extensive support for Large Multimodal Models, including [StableDiffusion](https://github.com/jason-dai/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion), [Phi-3-Vision](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision), [Qwen-VL](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl), and [more](python/llm/example/GPU/HuggingFace/Multimodal). +- [2024/07] We added **FP6** support on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types). - [2024/06] We added experimental **NPU** support for Intel Core Ultra processors; see the examples [here](python/llm/example/NPU/HF-Transformers-AutoModels). - [2024/06] We added extensive support of **pipeline parallel** [inference](python/llm/example/GPU/Pipeline-Parallel-Inference), which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc). - [2024/06] We added support for running **RAGFlow** with `ipex-llm` on Intel [GPU](docs/mddocs/Quickstart/ragflow_quickstart.md). @@ -33,7 +35,7 @@ - [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)). - [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)). - [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*). -- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). +- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). - [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*). - [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***. - [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available. @@ -196,26 +198,26 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the - *For more details, please refer to the [full installation guide](docs/mddocs/Overview/install.md)* ### Code Examples -- Low bit inference +- #### Low bit inference - [INT4 inference](python/llm/example/GPU/HuggingFace/LLM): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model) - - [FP8/FP4 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) - - [INT8 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) + - [FP8/FP6/FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types): **FP8**, **FP6** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types) + - [INT8 inference](python/llm/example/GPU/HuggingFace/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) - [INT2 inference](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) -- FP16/BF16 inference +- #### FP16/BF16 inference - **FP16** LLM inference on Intel [GPU](python/llm/example/GPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization - **BF16** LLM inference on Intel [CPU](python/llm/example/CPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization -- Distributed inference +- #### Distributed inference - **Pipeline Parallel** inference on Intel [GPU](python/llm/example/GPU/Pipeline-Parallel-Inference) - **DeepSpeed AutoTP** inference on Intel [GPU](python/llm/example/GPU/Deepspeed-AutoTP) -- Save and load +- #### Save and load - [Low-bit models](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load): saving and loading `ipex-llm` low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) - [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm` - [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm` - [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm` -- Finetuning +- #### Finetuning - LLM finetuning on Intel [GPU](python/llm/example/GPU/LLM-Finetuning), including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) - QLoRA finetuning on Intel [CPU](python/llm/example/CPU/QLoRA-FineTuning) -- Integration with community libraries +- #### Integration with community libraries - [HuggingFace transformers](python/llm/example/GPU/HuggingFace) - [Standard PyTorch model](python/llm/example/GPU/PyTorch-Models) - [LangChain](python/llm/example/GPU/LangChain) diff --git a/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md index d97d0e40361..0c40ca2cf03 100644 --- a/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md +++ b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md @@ -1,6 +1,6 @@ -# IPEX-LLM Transformers Low-Bit Inference Pipeline (FP8, FP4, INT4 and more) +# IPEX-LLM Transformers Low-Bit Inference Pipeline (FP8, FP6, FP4, INT4 and more) -In this example, we show a pipeline to apply IPEX-LLM low-bit optimizations (including **FP8/INT8/MixedFP8/FP4/INT4/MixedFP4**) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model. +In this example, we show a pipeline to apply IPEX-LLM low-bit optimizations (including **FP8/INT8/FP6/FP4/INT4**) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model. ## Prepare Environment We suggest using conda to manage environment: @@ -18,7 +18,7 @@ python ./transformers_low_bit_pipeline.py --repo-id-or-model-path meta-llama/Lla ``` arguments info: - `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is `meta-llama/Llama-2-7b-chat-hf` by default. -- `--low-bit`: str value, options are fp8, sym_int8, fp4, sym_int4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model. +- `--low-bit`: str value, options are fp8, fp6, sym_int8, fp4, sym_int4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model. - `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly. - `--load-path`: optional str value. The path to load low-bit model.