diff --git a/docs/mddocs/Quickstart/npu_quickstart.md b/docs/mddocs/Quickstart/npu_quickstart.md index 6ffe46bce99..8b017e370c3 100644 --- a/docs/mddocs/Quickstart/npu_quickstart.md +++ b/docs/mddocs/Quickstart/npu_quickstart.md @@ -82,42 +82,28 @@ With the `llm-npu` environment active, use `pip` to install `ipex-llm` for NPU: conda activate llm-npu pip install --pre --upgrade ipex-llm[npu] - -:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct -pip install transformers==4.45.0 accelerate==0.33.0 - -:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat -pip install transformers==4.47.0 accelerate==0.26.0 ``` ## Runtime Configurations For `ipex-llm` NPU support, please set the following environment variable with active `llm-npu` environment based on your device: -
+- For **Intel Core™ Ultra Processers (Series 2) with processor number 2xxV** -For Intel Core™ Ultra Processers (Series 2) with processor number 2xxV - -```bash +```cmd set BIGDL_USE_NPU=1 -# [optional] for Intel Core™ Ultra 5 Processor 228V & 226V +:: [optional] for Intel Core™ Ultra 5 Processor 228V & 226V set IPEX_LLM_NPU_DISABLE_COMPILE_OPT=1 ``` -
- -
- -For Intel Core™ Ultra Processers (Series 1) with processor number 1xxH +- For **Intel Core™ Ultra Processers (Series 1) with processor number 1xxH** ```bash set BIGDL_USE_NPU=1 set IPEX_LLM_NPU_MTL=1 ``` -
- ## Python API IPEX-LLM offers Hugging Face `transformers`-like Python API, enabling seamless running of Hugging Face transformers models on Intel NPU. @@ -134,10 +120,10 @@ Refer to the following table for examples of verified models: | GLM-Edge | [THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | | MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | | Baichuan 2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | -| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | +| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | > [!TIP] diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md index a999d1142fd..de212384cce 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md @@ -12,12 +12,27 @@ In this directory, you will find a C++ example on how to run LLM models on Intel | MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | | Llama3.2 | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | -## 0. Install Prerequisites +## 0. Prerequisites For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. ## 1. Install & Runtime Configurations ### 1.1 Installation on Windows -Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation. +We suggest using conda to manage environment: +```cmd +conda create -n llm python=3.11 +conda activate llm + +:: install ipex-llm with 'npu' option +pip install --pre --upgrade ipex-llm[npu] + +:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct +pip install transformers==4.45.0 accelerate==0.33.0 + +:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat +pip install transformers==4.47.0 accelerate==0.26.0 +``` + +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for more details about `ipex-llm` installation on Intel NPU. ### 1.2 Runtime Configurations Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md index e94ce2d0c07..fbf52ed9d96 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md @@ -19,13 +19,28 @@ For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../d ## 1. Install & Runtime Configurations ### 1.1 Installation on Windows -Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation. +We suggest using conda to manage environment: +```cmd +conda create -n llm python=3.11 +conda activate llm + +:: install ipex-llm with 'npu' option +pip install --pre --upgrade ipex-llm[npu] + +:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct +pip install transformers==4.45.0 accelerate==0.33.0 + +:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat +pip install transformers==4.47.0 accelerate==0.26.0 +``` + +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. ### 1.2 Runtime Configurations Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. -## 2. Run Models -In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. +## 2. Run Optimized Models +The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU: ```cmd :: to run Llama-2-7b-chat-hf diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index eb76f0ec112..e07cdd55f5f 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -22,14 +22,29 @@ In this directory, you will find examples on how to directly run HuggingFace `tr | Mistral | [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | ## 0. Prerequisites -For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. ## 1. Install & Runtime Configurations ### 1.1 Installation on Windows -Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation. +We suggest using conda to manage environment: +```cmd +conda create -n llm python=3.11 +conda activate llm + +:: install ipex-llm with 'npu' option +pip install --pre --upgrade ipex-llm[npu] + +:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct +pip install transformers==4.45.0 accelerate==0.33.0 + +:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat +pip install transformers==4.47.0 accelerate==0.26.0 +``` + +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. ### 1.2 Runtime Configurations -Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. ## 2. Run Optimized Models (Experimental) The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md index d24c1e15920..faa6504c1d3 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md @@ -11,16 +11,11 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o | Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | | Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | -## Requirements -To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU. -Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver. -Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. -Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install. - -## Example: Predict Tokens using `generate()` API -In the example [generate.py](./generate.py), we show a basic use case for a phi-3-vision model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. -### 1. Install -#### 1.1 Installation on Windows +## 0. Prerequisites +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. + +## 1. Install +### 1.1 Installation on Windows We suggest using conda to manage environment: ```bash conda create -n llm python=3.10 libuv @@ -40,67 +35,19 @@ pip install BCEmbedding==0.1.5 transformers==4.40.0 pip install funasr==1.1.14 pip install modelscope==1.20.1 torch==2.1.2 torchaudio==2.1.2 ``` +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. -### 2. Runtime Configurations -For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -#### 2.1 Configurations for Windows - -> [!NOTE] -> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal: -> - Press Win+R and input `conhost`, then press Enter to launch `conhost`. -> - Run following command to use conda in `conhost`. Replace `` with your conda install location. -> ``` -> call \Scripts\activate -> ``` - -**Following envrionment variables are required**: - -```cmd -set BIGDL_USE_NPU=1 -``` - -### 3. Running examples - -``` -python ./generate.py -``` - -Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models). -- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. -- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`. -- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`. -- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. - - -#### Sample Output -##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) - -```log -Inference time: xxxx s --------------------- Prompt -------------------- -Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}] -Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg --------------------- Output -------------------- - - -What is in the image? - The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone -``` - -The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)): - - +### 1.2 Runtime Configurations +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. -## 4. Run Optimized Models (Experimental) +## 2. Run Optimized Models (Experimental) The examples below show how to run the **_optimized HuggingFace & FunASR model implementations_** on Intel NPU, including - [MiniCPM-Llama3-V-2_5](./minicpm-llama3-v2.5.py) - [MiniCPM-V-2_6](./minicpm_v_2_6.py) - [Speech_Paraformer-Large](./speech_paraformer-large.py) - [Bce-Embedding-Base-V1 ](./bce-embedding.py) -### 4.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6 +### 2.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6 ```bash # to run MiniCPM-Llama3-V-2_5 python minicpm-llama3-v2.5.py --save-directory @@ -132,7 +79,7 @@ What is in this image? The image features a young child holding and showing off a white teddy bear wearing a pink dress. The background includes some red flowers and a stone wall, suggesting an outdoor setting. ``` -### 4.2 Run Speech_Paraformer-Large +### 2.2 Run Speech_Paraformer-Large ```bash # to run Speech_Paraformer-Large python speech_paraformer-large.py --save-directory @@ -156,7 +103,7 @@ rtf_avg: 0.232: 100%|███████████████████ [{'key': 'asr_example_zh', 'text': '欢 迎 大 家 来 体 验 达 摩 院 推 出 的 语 音 识 别 模 型'}] ``` -### 4.3 Run Bce-Embedding-Base-V1 +### 2.3 Run Bce-Embedding-Base-V1 ```bash # to run Bce-Embedding-Base-V1 python bce-embedding.py --save-directory @@ -176,3 +123,38 @@ Inference time: xxx s [-0.04398304 0.00023038 0.00643183 ... -0.02717186 0.00483789 0.02298774]] ``` + +### 3. Running examples + +``` +python ./generate.py +``` + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models). +- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. +- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`. +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. + + +#### Sample Output +##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) + +```log +Inference time: xxxx s +-------------------- Prompt -------------------- +Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}] +Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg +-------------------- Output -------------------- + + +What is in the image? + The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone +``` + +The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)): + + +