diff --git a/docs/mddocs/Quickstart/npu_quickstart.md b/docs/mddocs/Quickstart/npu_quickstart.md
index 6ffe46bce99..8b017e370c3 100644
--- a/docs/mddocs/Quickstart/npu_quickstart.md
+++ b/docs/mddocs/Quickstart/npu_quickstart.md
@@ -82,42 +82,28 @@ With the `llm-npu` environment active, use `pip` to install `ipex-llm` for NPU:
conda activate llm-npu
pip install --pre --upgrade ipex-llm[npu]
-
-:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
-pip install transformers==4.45.0 accelerate==0.33.0
-
-:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat
-pip install transformers==4.47.0 accelerate==0.26.0
```
## Runtime Configurations
For `ipex-llm` NPU support, please set the following environment variable with active `llm-npu` environment based on your device:
-
+- For **Intel Core™ Ultra Processers (Series 2) with processor number 2xxV**
-For Intel Core™ Ultra Processers (Series 2) with processor number 2xxV
-
-```bash
+```cmd
set BIGDL_USE_NPU=1
-# [optional] for Intel Core™ Ultra 5 Processor 228V & 226V
+:: [optional] for Intel Core™ Ultra 5 Processor 228V & 226V
set IPEX_LLM_NPU_DISABLE_COMPILE_OPT=1
```
-
-
-
-
-For Intel Core™ Ultra Processers (Series 1) with processor number 1xxH
+- For **Intel Core™ Ultra Processers (Series 1) with processor number 1xxH**
```bash
set BIGDL_USE_NPU=1
set IPEX_LLM_NPU_MTL=1
```
-
-
## Python API
IPEX-LLM offers Hugging Face `transformers`-like Python API, enabling seamless running of Hugging Face transformers models on Intel NPU.
@@ -134,10 +120,10 @@ Refer to the following table for examples of verified models:
| GLM-Edge | [THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) |
| MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) |
| Baichuan 2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental) |
-| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) |
-| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) |
-| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) |
-| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal#4-run-optimized-models-experimental) |
+| MiniCPM-Llama3-V-2_5 | [openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) |
+| MiniCPM-V-2_6 | [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) |
+| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) |
+| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | [link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental) |
> [!TIP]
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md
index a999d1142fd..de212384cce 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md
@@ -12,12 +12,27 @@ In this directory, you will find a C++ example on how to run LLM models on Intel
| MiniCPM | [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) |
| Llama3.2 | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
-## 0. Install Prerequisites
+## 0. Prerequisites
For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations.
## 1. Install & Runtime Configurations
### 1.1 Installation on Windows
-Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation.
+We suggest using conda to manage environment:
+```cmd
+conda create -n llm python=3.11
+conda activate llm
+
+:: install ipex-llm with 'npu' option
+pip install --pre --upgrade ipex-llm[npu]
+
+:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
+pip install transformers==4.45.0 accelerate==0.33.0
+
+:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat
+pip install transformers==4.47.0 accelerate==0.26.0
+```
+
+Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for more details about `ipex-llm` installation on Intel NPU.
### 1.2 Runtime Configurations
Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md
index e94ce2d0c07..fbf52ed9d96 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/Pipeline-Models/README.md
@@ -19,13 +19,28 @@ For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../d
## 1. Install & Runtime Configurations
### 1.1 Installation on Windows
-Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation.
+We suggest using conda to manage environment:
+```cmd
+conda create -n llm python=3.11
+conda activate llm
+
+:: install ipex-llm with 'npu' option
+pip install --pre --upgrade ipex-llm[npu]
+
+:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
+pip install transformers==4.45.0 accelerate==0.33.0
+
+:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat
+pip install transformers==4.47.0 accelerate==0.26.0
+```
+
+Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU.
### 1.2 Runtime Configurations
Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
-## 2. Run Models
-In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs.
+## 2. Run Optimized Models
+The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU:
```cmd
:: to run Llama-2-7b-chat-hf
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
index eb76f0ec112..e07cdd55f5f 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
@@ -22,14 +22,29 @@ In this directory, you will find examples on how to directly run HuggingFace `tr
| Mistral | [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) |
## 0. Prerequisites
-For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations.
+For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations.
## 1. Install & Runtime Configurations
### 1.1 Installation on Windows
-Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for `ipex-llm` installation.
+We suggest using conda to manage environment:
+```cmd
+conda create -n llm python=3.11
+conda activate llm
+
+:: install ipex-llm with 'npu' option
+pip install --pre --upgrade ipex-llm[npu]
+
+:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
+pip install transformers==4.45.0 accelerate==0.33.0
+
+:: [optional] for glm-edge-1.5b-chat & glm-edge-4b-chat
+pip install transformers==4.47.0 accelerate==0.26.0
+```
+
+Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU.
### 1.2 Runtime Configurations
-Please refer to [Quick Start](../../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
+Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
## 2. Run Optimized Models (Experimental)
The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md
index d24c1e15920..faa6504c1d3 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md
@@ -11,16 +11,11 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
| Bce-Embedding-Base-V1 | [maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) |
| Speech_Paraformer-Large | [iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) |
-## Requirements
-To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU.
-Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver.
-Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**.
-Right click and select **Update Driver** -> **Browse my computer for drivers**. And then manually select the unzipped driver folder to install.
-
-## Example: Predict Tokens using `generate()` API
-In the example [generate.py](./generate.py), we show a basic use case for a phi-3-vision model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs.
-### 1. Install
-#### 1.1 Installation on Windows
+## 0. Prerequisites
+For `ipex-llm` NPU support, please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations.
+
+## 1. Install
+### 1.1 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.10 libuv
@@ -40,67 +35,19 @@ pip install BCEmbedding==0.1.5 transformers==4.40.0
pip install funasr==1.1.14
pip install modelscope==1.20.1 torch==2.1.2 torchaudio==2.1.2
```
+Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU.
-### 2. Runtime Configurations
-For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
-#### 2.1 Configurations for Windows
-
-> [!NOTE]
-> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal:
-> - Press Win+R and input `conhost`, then press Enter to launch `conhost`.
-> - Run following command to use conda in `conhost`. Replace `` with your conda install location.
-> ```
-> call \Scripts\activate
-> ```
-
-**Following envrionment variables are required**:
-
-```cmd
-set BIGDL_USE_NPU=1
-```
-
-### 3. Running examples
-
-```
-python ./generate.py
-```
-
-Arguments info:
-- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models).
-- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string.
-- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
-- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
-- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
-- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
-
-
-#### Sample Output
-##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)
-
-```log
-Inference time: xxxx s
--------------------- Prompt --------------------
-Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}]
-Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
--------------------- Output --------------------
-
-
-What is in the image?
- The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone
-```
-
-The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
-
-
+### 1.2 Runtime Configurations
+Please refer to [Quick Start](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
-## 4. Run Optimized Models (Experimental)
+## 2. Run Optimized Models (Experimental)
The examples below show how to run the **_optimized HuggingFace & FunASR model implementations_** on Intel NPU, including
- [MiniCPM-Llama3-V-2_5](./minicpm-llama3-v2.5.py)
- [MiniCPM-V-2_6](./minicpm_v_2_6.py)
- [Speech_Paraformer-Large](./speech_paraformer-large.py)
- [Bce-Embedding-Base-V1 ](./bce-embedding.py)
-### 4.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6
+### 2.1 Run MiniCPM-Llama3-V-2_5 & MiniCPM-V-2_6
```bash
# to run MiniCPM-Llama3-V-2_5
python minicpm-llama3-v2.5.py --save-directory
@@ -132,7 +79,7 @@ What is in this image?
The image features a young child holding and showing off a white teddy bear wearing a pink dress. The background includes some red flowers and a stone wall, suggesting an outdoor setting.
```
-### 4.2 Run Speech_Paraformer-Large
+### 2.2 Run Speech_Paraformer-Large
```bash
# to run Speech_Paraformer-Large
python speech_paraformer-large.py --save-directory
@@ -156,7 +103,7 @@ rtf_avg: 0.232: 100%|███████████████████
[{'key': 'asr_example_zh', 'text': '欢 迎 大 家 来 体 验 达 摩 院 推 出 的 语 音 识 别 模 型'}]
```
-### 4.3 Run Bce-Embedding-Base-V1
+### 2.3 Run Bce-Embedding-Base-V1
```bash
# to run Bce-Embedding-Base-V1
python bce-embedding.py --save-directory
@@ -176,3 +123,38 @@ Inference time: xxx s
[-0.04398304 0.00023038 0.00643183 ... -0.02717186 0.00483789
0.02298774]]
```
+
+### 3. Running examples
+
+```
+python ./generate.py
+```
+
+Arguments info:
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models).
+- `--lowbit-path LOWBIT_MODEL_PATH`: argument defining the path to save/load lowbit version of the model. If it is an empty string, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded. If it is an existing path, the lowbit model in `LOWBIT_MODEL_PATH` will be loaded. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, and the converted lowbit version will be saved into `LOWBIT_MODEL_PATH`. It is default to be `''`, i.e. an empty string.
+- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
+- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
+- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
+- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
+
+
+#### Sample Output
+##### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)
+
+```log
+Inference time: xxxx s
+-------------------- Prompt --------------------
+Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}]
+Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+-------------------- Output --------------------
+
+
+What is in the image?
+ The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone
+```
+
+The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
+
+
+