diff --git a/docs/mddocs/Quickstart/npu_quickstart.md b/docs/mddocs/Quickstart/npu_quickstart.md index 0cb15750377..b4f445dd149 100644 --- a/docs/mddocs/Quickstart/npu_quickstart.md +++ b/docs/mddocs/Quickstart/npu_quickstart.md @@ -2,7 +2,7 @@ This guide demonstrates: -- How to install IPEX-LLM for Intel NPU on Intel Core™ Ultra Processers (Series 2) +- How to install IPEX-LLM for Intel NPU on Intel Core™ Ultra Processors - Python and C++ APIs for running IPEX-LLM on Intel NPU > [!IMPORTANT] @@ -19,9 +19,6 @@ This guide demonstrates: ## Install Prerequisites -> [!NOTE] -> IPEX-LLM NPU support on Windows has been verified on Intel Core™ Ultra Processers (Series 2) with processor number 2xxV (code name Lunar Lake). - ### Update NPU Driver > [!IMPORTANT] @@ -86,14 +83,27 @@ pip install --pre --upgrade ipex-llm[npu] ## Runtime Configurations -For `ipex-llm` NPU support, set the following environment variable with active `llm-npu` environment: +For `ipex-llm` NPU support, please set the following environment variable with active `llm-npu` environment based on your device: -```cmd -set BIGDL_USE_NPU=1 +- For **Intel Core™ Ultra Processors (Series 2) with processor number 2xxV (code name Lunar Lake)**: -:: [optional] for MTL support -set IPEX_LLM_NPU_MTL=1 -``` + - For Intel Core™ Ultra 7 Processor 258V: + ```cmd + set BIGDL_USE_NPU=1 + ``` + + - For Intel Core™ Ultra 5 Processor 228V & 226V: + ```cmd + set BIGDL_USE_NPU=1 + set IPEX_LLM_NPU_DISABLE_COMPILE_OPT=1 + ``` + +- For **Intel Core™ Ultra Processors (Series 1) with processor number 1xxH (code name Meteor Lake)**: + + ```bash + set BIGDL_USE_NPU=1 + set IPEX_LLM_NPU_MTL=1 + ``` ## Python API @@ -103,18 +113,18 @@ Refer to the following table for examples of verified models: [](../../../python/llm/) | Model | Model link | Example link | |:--|:--|:--| -| LLaMA 2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| LLaMA 3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| LLaMA 3.2 | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| Qwen 2 | [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| Qwen 2.5 | [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| GLM-Edge | [THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| Baichuan 2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM#4-run-optimized-models-experimental) | -| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | -| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) | +| LLaMA 2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| LLaMA 3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| LLaMA 3.2 | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| Qwen 2 | [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| Qwen 2.5 | [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| GLM-Edge | [THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| Baichuan 2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) | +| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | +| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) | > [!TIP] diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md index 4bd67aacd9b..3d77ac35b4d 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md @@ -12,17 +12,14 @@ In this directory, you will find a C++ example on how to run LLM models on Intel | MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | | Llama3.2 | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | -## 0. Requirements -To run this C++ example with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU. -Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver. -Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. -Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install. +## 0. Prerequisites +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. -## 1. Install +## 1. Install & Runtime Configurations ### 1.1 Installation on Windows We suggest using conda to manage environment: ```cmd -conda create -n llm python=3.10 +conda create -n llm python=3.11 conda activate llm :: install ipex-llm with 'npu' option @@ -32,6 +29,11 @@ pip install --pre --upgrade ipex-llm[npu] pip install transformers==4.45.0 accelerate==0.33.0 ``` +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for more details about `ipex-llm` installation on Intel NPU. + +### 1.2 Runtime Configurations +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. + ## 2. Convert Model We provide a [convert script](convert.py) under current directory, by running it, you can obtain the whole weights and configuration files which are required to run C++ example. diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md index 30db6e3f9bf..16024a837bc 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md @@ -14,17 +14,14 @@ In this directory, you will find examples on how to directly run HuggingFace `tr | Baichuan2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan-7B-Chat) | | MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | -## 0. Requirements -To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU. -Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver. -Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. -Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install. +## 0. Prerequisites +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. -## 1. Install +## 1. Install & Runtime Configurations ### 1.1 Installation on Windows We suggest using conda to manage environment: ```cmd -conda create -n llm python=3.10 +conda create -n llm python=3.11 conda activate llm :: install ipex-llm with 'npu' option @@ -34,16 +31,13 @@ pip install --pre --upgrade ipex-llm[npu] pip install transformers==4.45.0 accelerate==0.33.0 ``` -## 2. Runtime Configurations +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. -**Following environment variables are required**: +### 1.2 Runtime Configurations +Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. -```cmd -set BIGDL_USE_NPU=1 -``` - -## 3. Run Models -In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. +## 2. Run Optimized Models +The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU: ```cmd :: to run Llama-2-7b-chat-hf diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index e8ed3db7031..e07cdd55f5f 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -21,17 +21,14 @@ In this directory, you will find examples on how to directly run HuggingFace `tr | Deepseek | [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) | | Mistral | [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | -## 0. Requirements -To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU. -Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver. -Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. -Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install. +## 0. Prerequisites +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. -## 1. Install +## 1. Install & Runtime Configurations ### 1.1 Installation on Windows We suggest using conda to manage environment: ```cmd -conda create -n llm python=3.10 +conda create -n llm python=3.11 conda activate llm :: install ipex-llm with 'npu' option @@ -44,53 +41,12 @@ pip install transformers==4.45.0 accelerate==0.33.0 pip install transformers==4.47.0 accelerate==0.26.0 ``` -## 2. Runtime Configurations -For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -### 2.1 Configurations for Windows +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. -> [!NOTE] -> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal: -> - Search for `conhost` in the Windows search bar and run as administrator -> - Run following command to use conda in `conhost`. Replace `` with your conda install location. -> ``` -> call \Scripts\activate -> ``` +### 1.2 Runtime Configurations +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. -**Following envrionment variables are required**: - -```cmd -set BIGDL_USE_NPU=1 - -:: [optional] for running models on MTL -set IPEX_LLM_NPU_MTL=1 -``` - -## 3. Run Models -In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. - -``` -python ./generate.py -``` - -Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models). -- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. -- `--prompt PROMPT`: argument defining the prompt to be infered. It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`. -- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -- `--low_bit`: argument defining the `low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. - -### Sample Output -#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) - -```log -Inference time: xxxx s --------------------- Output -------------------- - Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always telling her to stay at home and be careful. They were worried about her safety, and they didn't want her to --------------------------------------------------------------------------------- -done -``` - -## 4. Run Optimized Models (Experimental) +## 2. Run Optimized Models (Experimental) The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including - [Llama2-7B](./llama2.py) - [Llama3-8B](./llama3.py) @@ -188,3 +144,28 @@ What is AI? [/INST] What is AI? [/INST] AI (Artificial Intelligence) is a field of computer science and engineering that focuses on the development of intelligent machines that can perform tasks ``` + +## 3. Run Models +In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. + +``` +python ./generate.py +``` + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models). +- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. +- `--prompt PROMPT`: argument defining the prompt to be infered. It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--low_bit`: argument defining the `low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. + +### Sample Output +#### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) + +```log +Inference time: xxxx s +-------------------- Output -------------------- + Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always telling her to stay at home and be careful. They were worried about her safety, and they didn't want her to +-------------------------------------------------------------------------------- +done +``` diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md index d24c1e15920..faa6504c1d3 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md @@ -11,16 +11,11 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o | Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | | Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | -## Requirements -To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU. -Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver. -Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. -Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install. - -## Example: Predict Tokens using `generate()` API -In the example [generate.py](./generate.py), we show a basic use case for a phi-3-vision model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. -### 1. Install -#### 1.1 Installation on Windows +## 0. Prerequisites +For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations. + +## 1. Install +### 1.1 Installation on Windows We suggest using conda to manage environment: ```bash conda create -n llm python=3.10 libuv @@ -40,67 +35,19 @@ pip install BCEmbedding==0.1.5 transformers==4.40.0 pip install funasr==1.1.14 pip install modelscope==1.20.1 torch==2.1.2 torchaudio==2.1.2 ``` +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU. -### 2. Runtime Configurations -For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -#### 2.1 Configurations for Windows - -> [!NOTE] -> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal: -> - Press Win+R and input `conhost`, then press Enter to launch `conhost`. -> - Run following command to use conda in `conhost`. Replace `` with your conda install location. -> ``` -> call \Scripts\activate -> ``` - -**Following envrionment variables are required**: - -```cmd -set BIGDL_USE_NPU=1 -``` - -### 3. Running examples - -``` -python ./generate.py -``` - -Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models). -- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. -- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`. -- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`. -- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. -- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. - - -#### Sample Output -##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) - -```log -Inference time: xxxx s --------------------- Prompt -------------------- -Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}] -Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg --------------------- Output -------------------- - - -What is in the image? - The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone -``` - -The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)): - - +### 1.2 Runtime Configurations +Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device. -## 4. Run Optimized Models (Experimental) +## 2. Run Optimized Models (Experimental) The examples below show how to run the **_optimized HuggingFace & FunASR model implementations_** on Intel NPU, including - [MiniCPM-Llama3-V-2_5](./minicpm-llama3-v2.5.py) - [MiniCPM-V-2_6](./minicpm_v_2_6.py) - [Speech_Paraformer-Large](./speech_paraformer-large.py) - [Bce-Embedding-Base-V1 ](./bce-embedding.py) -### 4.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6 +### 2.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6 ```bash # to run MiniCPM-Llama3-V-2_5 python minicpm-llama3-v2.5.py --save-directory @@ -132,7 +79,7 @@ What is in this image? The image features a young child holding and showing off a white teddy bear wearing a pink dress. The background includes some red flowers and a stone wall, suggesting an outdoor setting. ``` -### 4.2 Run Speech_Paraformer-Large +### 2.2 Run Speech_Paraformer-Large ```bash # to run Speech_Paraformer-Large python speech_paraformer-large.py --save-directory @@ -156,7 +103,7 @@ rtf_avg: 0.232: 100%|███████████████████ [{'key': 'asr_example_zh', 'text': '欢 迎 大 家 来 体 验 达 摩 院 推 出 的 语 音 识 别 模 型'}] ``` -### 4.3 Run Bce-Embedding-Base-V1 +### 2.3 Run Bce-Embedding-Base-V1 ```bash # to run Bce-Embedding-Base-V1 python bce-embedding.py --save-directory @@ -176,3 +123,38 @@ Inference time: xxx s [-0.04398304 0.00023038 0.00643183 ... -0.02717186 0.00483789 0.02298774]] ``` + +### 3. Running examples + +``` +python ./generate.py +``` + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models). +- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string. +- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`. +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. + + +#### Sample Output +##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) + +```log +Inference time: xxxx s +-------------------- Prompt -------------------- +Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}] +Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg +-------------------- Output -------------------- + + +What is in the image? + The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone +``` + +The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)): + + +