From 099486afb72d4aa3aad9672beaa0e0043a2fc8c5 Mon Sep 17 00:00:00 2001
From: Jason Dai <jason.dai@gmail.com>
Date: Mon, 8 Jul 2024 20:18:41 +0800
Subject: [PATCH] Update README.md (#11530)

---
 README.md                                     | 20 ++++++++++---------
 .../GPU/HuggingFace/More-Data-Types/README.md |  6 +++---
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/README.md b/README.md
index 618b0148881..c83e613ded8 100644
--- a/README.md
+++ b/README.md
@@ -11,6 +11,8 @@
 > - ***50+ models** have been optimized/verified on `ipex-llm` (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list [here](#verified-models).*
 
 ## Latest Update 🔥 
+- [2024/07] We added extensive support for Large Multimodal Models, including [StableDiffusion](https://github.com/jason-dai/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion), [Phi-3-Vision](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision), [Qwen-VL](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl), and [more](python/llm/example/GPU/HuggingFace/Multimodal).
+- [2024/07] We added **FP6** support on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types). 
 - [2024/06] We added experimental **NPU** support for Intel Core Ultra processors; see the examples [here](python/llm/example/NPU/HF-Transformers-AutoModels). 
 - [2024/06] We added extensive support of **pipeline parallel** [inference](python/llm/example/GPU/Pipeline-Parallel-Inference), which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
 - [2024/06] We added support for running **RAGFlow** with `ipex-llm` on Intel [GPU](docs/mddocs/Quickstart/ragflow_quickstart.md).
@@ -33,7 +35,7 @@
 - [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)).
 - [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)). 
 - [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*).
-- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). 
+- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). 
 - [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*). 
 - [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***.
 - [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available.
@@ -196,26 +198,26 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the
 - *For more details, please refer to the [full installation guide](docs/mddocs/Overview/install.md)*
 
 ### Code Examples
-- Low bit inference
+- #### Low bit inference
   - [INT4 inference](python/llm/example/GPU/HuggingFace/LLM): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model)
-  - [FP8/FP4 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types)
-  - [INT8 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types)
+  - [FP8/FP6/FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types): **FP8**, **FP6** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types)
+  - [INT8 inference](python/llm/example/GPU/HuggingFace/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types)
   - [INT2 inference](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2)
-- FP16/BF16 inference
+- #### FP16/BF16 inference
   - **FP16** LLM inference on Intel [GPU](python/llm/example/GPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization
   - **BF16** LLM inference on Intel [CPU](python/llm/example/CPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization
-- Distributed inference
+- #### Distributed inference
   - **Pipeline Parallel** inference on Intel [GPU](python/llm/example/GPU/Pipeline-Parallel-Inference)
   - **DeepSpeed AutoTP** inference on Intel [GPU](python/llm/example/GPU/Deepspeed-AutoTP)
-- Save and load
+- #### Save and load
   - [Low-bit models](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load): saving and loading `ipex-llm` low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
   - [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm`
   - [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm`
   - [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm`
-- Finetuning
+- #### Finetuning
   - LLM finetuning on Intel [GPU](python/llm/example/GPU/LLM-Finetuning), including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)
   - QLoRA finetuning on Intel [CPU](python/llm/example/CPU/QLoRA-FineTuning)
-- Integration with community libraries
+- #### Integration with community libraries
   - [HuggingFace transformers](python/llm/example/GPU/HuggingFace)
   - [Standard PyTorch model](python/llm/example/GPU/PyTorch-Models)
   - [LangChain](python/llm/example/GPU/LangChain)
diff --git a/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md
index d97d0e40361..0c40ca2cf03 100644
--- a/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md
+++ b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md
@@ -1,6 +1,6 @@
-# IPEX-LLM Transformers Low-Bit Inference Pipeline (FP8, FP4, INT4 and more)
+# IPEX-LLM Transformers Low-Bit Inference Pipeline (FP8, FP6, FP4, INT4 and more)
 
-In this example, we show a pipeline to apply IPEX-LLM low-bit optimizations (including **FP8/INT8/MixedFP8/FP4/INT4/MixedFP4**) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
+In this example, we show a pipeline to apply IPEX-LLM low-bit optimizations (including **FP8/INT8/FP6/FP4/INT4**) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
 
 ## Prepare Environment
 We suggest using conda to manage environment:
@@ -18,7 +18,7 @@ python ./transformers_low_bit_pipeline.py --repo-id-or-model-path meta-llama/Lla
 ```
 arguments info:
 - `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is `meta-llama/Llama-2-7b-chat-hf` by default.
-- `--low-bit`: str value, options are fp8, sym_int8, fp4, sym_int4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model.
+- `--low-bit`: str value, options are fp8, fp6, sym_int8, fp4, sym_int4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model.
 - `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
 - `--load-path`: optional str value. The path to load low-bit model.