From 80d08e9e19f49916a71e3145e652abf802e9e9e5 Mon Sep 17 00:00:00 2001
From: Zhao Changmin <changmin.zhao@intel.com>
Date: Tue, 9 Jul 2024 17:19:42 +0800
Subject: [PATCH] update NPU examples (#11540)

* update NPU examples
---
 .../{Model/llama2 => LLM}/README.md            | 18 +++++++++++++++---
 .../{Model/llama2 => LLM}/generate.py          |  0
 2 files changed, 15 insertions(+), 3 deletions(-)
 rename python/llm/example/NPU/HF-Transformers-AutoModels/{Model/llama2 => LLM}/README.md (66%)
 rename python/llm/example/NPU/HF-Transformers-AutoModels/{Model/llama2 => LLM}/generate.py (100%)

diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Model/llama2/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
similarity index 66%
rename from python/llm/example/NPU/HF-Transformers-AutoModels/Model/llama2/README.md
rename to python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
index ff4a9c1c059..65a672637b3 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/Model/llama2/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
@@ -1,5 +1,17 @@
-# Run LLama2 on Intel NPU
-In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Llama2 models on [Intel NPUs](../../../README.md). For illustration purposes, we utilize the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as reference Llama2 models.
+# Run Large Language Model on Intel NPU
+In this directory, you will find examples on how you could apply IPEX-LLM INT4 or INT8 optimizations on LLM models on [Intel NPUs](../../../README.md). For illustration purposes, we utilize the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as reference Llama2 models. In this directory, you will find examples on how you could apply IPEX-LLM INT4 or INT8 optimizations on LLM models on Intel NPUs. See the table blow for verified models.
+
+## Verification Models
+
+| Model      | Model Link                                                    |
+|------------|----------------------------------------------------------------|
+| Llama2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
+| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
+| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
+| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |
+| MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) |
+| Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
+| Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) |
 
 ## 0. Requirements
 To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU.
@@ -42,7 +54,7 @@ python ./generate.py
 ```
 
 Arguments info:
-- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verification Models](#verification-models).
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 - `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Model/llama2/generate.py b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/generate.py
similarity index 100%
rename from python/llm/example/NPU/HF-Transformers-AutoModels/Model/llama2/generate.py
rename to python/llm/example/NPU/HF-Transformers-AutoModels/LLM/generate.py