intel-analytics · yangw1234 · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024 · Jun 21, 2024
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Model/tinyllama/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/Model/tinyllama/README.md
@@ -0,0 +1,57 @@
+# Run LLama2 on Intel NPU
+In this directory, you will find examples on how you could apply run tinyllama on intel NPU devices.
+
+## 0. Requirements
+To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU.
+Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver.
+Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**.
+Right click and select **Update Driver**. And then manually select the folder unzipped from the driver.
+
+## Example: Predict Tokens using `generate()` API
+In the example [generate.py](./generate.py), we show a basic use case for a tinyllama model to predict the next N tokens using `generate()` API on Intel NPUs.
+### 1. Install
+#### 1.1 Installation on Windows
+We suggest using conda to manage environment:
+```bash
+conda create -n llm python=3.10 libuv
+conda activate llm
+pip install --pre --upgrade ipex-llm
+pip install openvino
+pip install onnx
+pip install torch
+pip install accelerate
+pip install transformers==4.35.1
+```
+
+### 2. Runtime Configurations
+For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
+#### 2.1 Configurations for Windows
+<details>
+
+```cmd
+set BIGDL_USE_NPU=1
+```
+
+</details>
+
+### 3. Running examples
+
+```
+python ./generate.py
+```
+
+Arguments info:
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the TinyLlama model (e.g. `TinyLlama/TinyLlama-1.1B-Chat-v1.0`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'TinyLlama/TinyLlama-1.1B-Chat-v1.0'`.
+- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'Once upon a time, there is a little girl named Lily who lives in a small village.'`.
+- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
+
+#### Sample Output
+#### [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
+
+```log
+Inference time: xxxx s
+-------------------- Output --------------------
+<s> Once upon a time, there is a little girl named Lily who lives in a small village. She loves to play with her friends and spend time with her family.<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
+--------------------------------------------------------------------------------
+done
+```
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/Model/tinyllama/generate.py b/python/llm/example/NPU/HF-Transformers-AutoModels/Model/tinyllama/generate.py
@@ -0,0 +1,62 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import torch
+import time
+import argparse
+
+from ipex_llm.transformers.npu_model import AutoModelForCausalLM
+from transformers import AutoTokenizer
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for npu model')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="D:\llm-models\TinyLlama-1.1B-Chat-v1.0",
+                        help='The huggingface repo id for the tinyllama model to be downloaded'
+                             ', or the path to the huggingface checkpoint folder')
+    parser.add_argument('--prompt', type=str, default="Once upon a time, there is a little girl named Lily who lives in a small village.",
+                        help='Prompt to infer')
+    parser.add_argument('--n-predict', type=int, default=32,
+                        help='Max tokens to predict')
+
+    args = parser.parse_args()
+    model_path = args.repo_id_or_model_path
+
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True,
+                                                 npu_backend="openvino")
+
+    print(model)
+
+    from benchmark_util import BenchmarkWrapper
+
+    model = BenchmarkWrapper(model, do_print=True)
+
+    with torch.inference_mode():
+        input_ids = tokenizer.encode(args.prompt, return_tensors="pt")
+        print("finish to load")
+        print('input length:', len(input_ids[0]))
+        st = time.time()
+        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict)
+        end = time.time()
+        print(f'Inference time: {end-st} s')
+        output_str = tokenizer.decode(output[0], skip_special_tokens=False)
+        print('-'*20, 'Output', '-'*20)
+        print(output_str)
+
+    print('-'*80)
+    print('done')
diff --git a/python/llm/src/ipex_llm/transformers/npu/__init__.py b/python/llm/src/ipex_llm/transformers/npu/__init__.py
@@ -0,0 +1,15 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#