From 65e281bb29bcd98d691d5350ddabe5b34d70090d Mon Sep 17 00:00:00 2001
From: "Jin, Qiao" <89779290+JinBridger@users.noreply.github.com>
Date: Mon, 2 Sep 2024 10:17:57 +0800
Subject: [PATCH 01/16] Add MiniCPM-V cpu example (#11975)

* Add MiniCPM-V cpu example

* fix

* fix

* fix

* fix
---
 README.md                                     |   2 +-
 .../Model/minicpm-v/README.md                 | 101 ++++++++++++++++++
 .../Model/minicpm-v/chat.py                   | 100 +++++++++++++++++
 3 files changed, 202 insertions(+), 1 deletion(-)
 create mode 100644 python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md
 create mode 100644 python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py

diff --git a/README.md b/README.md
index 3c767128c74..a34c880b782 100644
--- a/README.md
+++ b/README.md
@@ -319,7 +319,7 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
 | MiniCPM-V |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V) |
 | MiniCPM-V-2 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2) |
 | MiniCPM-Llama3-V-2_5 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-V-2_6 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | 
+| MiniCPM-V-2_6 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v) | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | 
 
 ## Get Support
 - Please report a bug or raise a feature request by opening a [Github Issue](https://github.com/intel-analytics/ipex-llm/issues)
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md
new file mode 100644
index 00000000000..640be289d36
--- /dev/null
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md
@@ -0,0 +1,101 @@
+# MiniCPM-V
+In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on MiniCPM-V models. For illustration purposes, we utilize the [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) as a reference MiniCPM-V model.
+
+## 0. Requirements
+To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
+
+## Example: Predict Tokens using `chat()` API
+In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-V model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations.
+### 1. Install
+We suggest using conda to manage environment:
+
+On Linux:
+
+```bash
+conda create -n llm python=3.11
+conda activate llm
+
+# install ipex-llm with 'all' option
+pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+pip install torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cpu
+pip install transformers==4.40.0 trl
+```
+On Windows:
+
+```cmd
+conda create -n llm python=3.11
+conda activate llm
+
+pip install --pre --upgrade ipex-llm[all]
+pip install torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cpu
+pip install transformers==4.40.0 trl
+```
+
+### 2. Run
+
+- chat without streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?'
+  ```
+- chat in streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?' --stream
+  ```
+
+> [!TIP]
+> For chatting in streaming mode, it is recommended to set the environment variable `PYTHONUNBUFFERED=1`.
+
+
+Arguments info:
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-V model (e.g. `openbmb/MiniCPM-V-2_6`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-V-2_6'`.
+- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
+- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
+- `--stream`: flag to chat in streaming mode
+
+> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
+>
+> Please select the appropriate size of the MiniCPM model based on the capabilities of your machine.
+
+#### 2.1 Client
+On client Windows machine, it is recommended to run directly with full utilization of all cores:
+```cmd
+python ./chat.py 
+```
+
+#### 2.2 Server
+For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
+
+E.g. on Linux,
+```bash
+# set IPEX-LLM env variables
+source ipex-llm-init
+
+# e.g. for a server with 48 cores per socket
+export OMP_NUM_THREADS=48
+numactl -C 0-47 -m 0 python ./chat.py
+```
+
+#### 2.3 Sample Output
+#### [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
+```log
+Inference time: xxxx s
+-------------------- Input Image --------------------
+http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+-------------------- Input Prompt --------------------
+What is in the image?
+-------------------- Chat Output --------------------
+The image features a young child holding a white teddy bear dressed in pink. The background includes some red flowers and what appears to be a stone wall.
+```
+
+```log
+-------------------- Input Image --------------------
+http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+-------------------- Input Prompt --------------------
+图片里有什么？
+-------------------- Stream Chat Output --------------------
+图片中有一个小女孩，她手里拿着一个穿着粉色裙子的白色小熊玩偶。背景中有红色花朵和石头结构，可能是一个花园或庭院。
+```
+
+The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
+
+<a href="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"><img width=400px src="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg" ></a>
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py
new file mode 100644
index 00000000000..e0a07c59aa8
--- /dev/null
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py
@@ -0,0 +1,100 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import os
+import time
+import argparse
+import requests
+import torch
+from PIL import Image
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Predict Tokens using `chat()` API for MiniCPM-V model')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="openbmb/MiniCPM-V-2_6",
+                        help='The huggingface repo id for the MiniCPM-V model to be downloaded'
+                             ', or the path to the huggingface checkpoint folder')
+    parser.add_argument('--image-url-or-path', type=str,
+                        default='http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg',
+                        help='The URL or path to the image to infer')
+    parser.add_argument('--prompt', type=str, default="What is in the image?",
+                        help='Prompt to infer')
+    parser.add_argument('--stream', action='store_true',
+                        help='Whether to chat in streaming mode')
+
+    args = parser.parse_args()
+    model_path = args.repo_id_or_model_path
+    image_path = args.image_url_or_path
+
+    # Load model in 4 bit,
+    # which convert the relevant layers in the model into INT4 format
+    model = AutoModel.from_pretrained(model_path,
+                                      load_in_low_bit="sym_int4",
+                                      optimize_model=True,
+                                      trust_remote_code=True,
+                                      use_cache=True,
+                                      torch_dtype=torch.float32,
+                                      modules_to_not_convert=["vpm", "resampler"])
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                              trust_remote_code=True)
+    model.eval()
+
+    query = args.prompt
+    if os.path.exists(image_path):
+       image = Image.open(image_path).convert('RGB')
+    else:
+       image = Image.open(requests.get(image_path, stream=True).raw).convert('RGB')
+
+    # Generate predicted tokens
+    # here the prompt tuning refers to https://huggingface.co/openbmb/MiniCPM-V-2_6/blob/main/README.md
+    msgs = [{'role': 'user', 'content': [image, args.prompt]}]
+
+    if args.stream:
+        res = model.chat(
+            image=None,
+            msgs=msgs,
+            tokenizer=tokenizer,
+            stream=True
+        )
+
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Stream Chat Output', '-'*20)
+        for new_text in res:
+            print(new_text, flush=True, end='')
+    else:
+        st = time.time()
+        res = model.chat(
+            image=None,
+            msgs=msgs,
+            tokenizer=tokenizer,
+        )
+        end = time.time()
+
+        print(f'Inference time: {end-st} s')
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Chat Output', '-'*20)
+        print(res)

From c48817bd433cac518d1877cd95071da14ee214cd Mon Sep 17 00:00:00 2001
From: Yang Wang <yang3.wang@intel.com>
Date: Sun, 1 Sep 2024 23:37:44 -0700
Subject: [PATCH 02/16] Support Qwen2-7b MLP in int4 and
 transpose_value_cache=True (#11968)

---
 .../transformers/npu_models/convert_mp.py     |  7 ++-
 .../transformers/npu_models/qwen2_mp.py       | 53 +++++++++++++++----
 2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py b/python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py
index 5dac6c5a871..0c70bf635b0 100644
--- a/python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py
+++ b/python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py
@@ -65,6 +65,11 @@ def optimize_llm_pre(model: torch.nn.Module, qtype):
             model.llm.config.model_type = "llama"
         model = model.llm
 
+    if model.config.model_type == "qwen2":
+        from ipex_llm.transformers.npu_models.qwen2_mp import split_mlp_down_proj
+        from ipex_llm.transformers.npu_models.qwen2_mp import split_mlp_forward
+        model.apply(split_mlp_down_proj)
+
     # lm_head to cpu optimization
     if cpu_lm_head:
         # disable the optimization by default
@@ -134,8 +139,6 @@ def optimize_llm(
             intra_pp = 2
         if inter_pp is None:
             inter_pp = 4 if model.config.intermediate_size == 18944 else 1
-        if model.config.intermediate_size == 18944:
-            transpose_value_cache = False
 
         from ipex_llm.transformers.npu_models.qwen2_mp import gen_qwen2_fused_model_forward
         from ipex_llm.transformers.npu_models.qwen2_mp import DecodeRunner, PrefillRunner
diff --git a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
index 61bff6e76a4..30e9054d8e4 100644
--- a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
+++ b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
@@ -42,6 +42,30 @@
 from ipex_llm.transformers.npu_models.common import reshape_lm_head_input
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from torch.nn import CrossEntropyLoss
+from transformers.models.qwen2.modeling_qwen2 import Qwen2MLP
+
+
+def split_mlp_down_proj(module: torch.nn.Module):
+    if isinstance(module, Qwen2MLP) and module.down_proj.in_features == 18944:
+        new_linear_0 = torch.nn.Linear(0, 0, bias=False)
+        new_weight_0 = torch.nn.Parameter(module.down_proj.weight[:, :9472], requires_grad=False)
+        new_linear_0.weight = new_weight_0
+        new_linear_0.in_features = new_weight_0.size(1)
+        new_linear_0.out_features = new_weight_0.size(0)
+        module.down_proj_0 = new_linear_0
+        new_linear_1 = torch.nn.Linear(0, 0, bias=False)
+        new_weight_1 = torch.nn.Parameter(module.down_proj.weight[:, 9472:], requires_grad=False)
+        new_linear_1.weight = new_weight_1
+        new_linear_1.in_features = new_weight_1.size(1)
+        new_linear_1.out_features = new_weight_1.size(0)
+        module.down_proj_1 = new_linear_1
+
+        del module.down_proj
+
+
+def split_mlp_forward(self, x):
+    h = self.act_fn(self.gate_proj(x)) * self.up_proj(x)
+    return self.down_proj_0(h[:, :, :9472]) + self.down_proj_1(h[:, :, 9472:])
 
 
 class LowBitQwenMultiDecoderlayer(LLMBaseNNFactory):
@@ -201,7 +225,7 @@ def __init__(
         self.compile()
         print("end compiling")
 
-    def mlp(self, hidden_states):
+    def mlp(self, hidden_states, seq_len):
         mm1 = self.linear(
             hidden_states, self.intermediate_size, self.hidden_size, bias=False, wt_dtype=self.dtype
         )
@@ -211,9 +235,13 @@ def mlp(self, hidden_states):
         mm1 = self.eltwise_mul(self.swish(mm1), mm2)  # type: ignore[attr-defined]
         if self.intermediate_size == 18944:
             # for qwen2-7b
-            hidden_states = self.linear(
-                mm1, self.hidden_size, self.intermediate_size, bias=False, wt_dtype=np.int8
-            )
+            mm1_0 = self.slice(mm1, begin=[0, 0, 0], end=[1, seq_len, 9472])
+            mm1_1 = self.slice(mm1, begin=[0, 0, 9472], end=[1, seq_len, 18944])
+            hidden_states_0 = self.linear(mm1_0, self.hidden_size, 9472,
+                                          bias=False, wt_dtype=self.dtype)
+            hidden_states_1 = self.linear(mm1_1, self.hidden_size, 9472,
+                                          bias=False, wt_dtype=self.dtype)
+            hidden_states = hidden_states_0 + hidden_states_1
         else:
             hidden_states = self.linear(
                 mm1, self.hidden_size, self.intermediate_size, bias=False, wt_dtype=self.dtype
@@ -257,7 +285,7 @@ def build_decoder(
         hidden_states = self.eltwise_add(residual, attn_output)
         residual = hidden_states
         hidden_states = self.layer_norm(hidden_states, post_attention_layernorm_weight)
-        hidden_states = self.mlp(hidden_states)
+        hidden_states = self.mlp(hidden_states, self.seq_len)
         hidden_states = self.eltwise_add(residual, hidden_states)
         hidden_states = self.convert_to_fp16(hidden_states)
 
@@ -343,9 +371,13 @@ def __init__(
             )
             self.backend_decoders.append(decoder)
 
+        offset = 0
         for i in range(intra_stages):
             start, end = self.layer_ranges[i]
-            self.backend_decoders[i].set_weights(self.op_id, op_parameters[start * 7:end * 7])
+            curr_linear_ops = len(self.backend_decoders[i].linear_ops)
+            curr_parameters = self.op_parameters[offset:offset + curr_linear_ops]
+            self.backend_decoders[i].set_weights(self.op_id, curr_parameters)
+            offset = offset + curr_linear_ops
 
     def forward(
         self,
@@ -543,7 +575,8 @@ def run_decode(
             (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
             (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
             (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
-            (mlp_layer.down_proj.weight, mlp_layer.down_proj.scale),
+            (mlp_layer.down_proj_0.weight, mlp_layer.down_proj_0.scale),
+            (mlp_layer.down_proj_1.weight, mlp_layer.down_proj_1.scale)
         ]
 
         cached_cos = curr_layer.self_attn.rotary_emb.cos_cached.to(torch.float16)
@@ -814,6 +847,8 @@ def run_prefill(
             transpose_value=transpose_value_cache
         )
         convert_forward(model, Qwen2Attention, qwen2_attention_forward)
+        from transformers.models.qwen2.modeling_qwen2 import Qwen2MLP
+        convert_forward(model, Qwen2MLP, split_mlp_forward)
         deocderlayers = model.model.layers
 
     while True:
@@ -836,7 +871,6 @@ def run_prefill(
 
                 hidden_states = layer_outputs[0]
                 next_decoder_cache = layer_outputs[1]
-
             result_queue.put((hidden_states, next_decoder_cache))
 
 
@@ -1124,10 +1158,11 @@ def qwen2_attention_forward(
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
                                                         cos, sin, position_ids)
-
         cache_kwargs = {"max_seq_len": max_seq_len, "transpose": transpose_value, }
 
         if past_key_value is not None:
+            if transpose_value:
+                value_states = value_states.transpose(-1, -2)
             key_states, value_states = past_key_value.update(key_states, value_states,
                                                              self.layer_idx, cache_kwargs)
 

From a40ea7038d83e1ebee0a21b665b51830d11a8014 Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Mon, 2 Sep 2024 17:55:10 +0800
Subject: [PATCH 03/16] Fix AttributeError of qwen2-1.5B (#11990)

---
 .../transformers/npu_models/qwen2_mp.py       | 32 +++++++++++++------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
index 30e9054d8e4..c6cb74c8e80 100644
--- a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
+++ b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
@@ -568,16 +568,28 @@ def run_decode(
         attn_layer = curr_layer.self_attn
         mlp_layer = curr_layer.mlp
 
-        weights = [
-            (attn_layer.q_proj.weight, attn_layer.q_proj.scale),
-            (attn_layer.k_proj.weight, attn_layer.k_proj.scale),
-            (attn_layer.v_proj.weight, attn_layer.v_proj.scale),
-            (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
-            (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
-            (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
-            (mlp_layer.down_proj_0.weight, mlp_layer.down_proj_0.scale),
-            (mlp_layer.down_proj_1.weight, mlp_layer.down_proj_1.scale)
-        ]
+        if model.config.intermediate_size == 8960:
+            # for qwen2-1.5b
+            weights = [
+                (attn_layer.q_proj.weight, attn_layer.q_proj.scale),
+                (attn_layer.k_proj.weight, attn_layer.k_proj.scale),
+                (attn_layer.v_proj.weight, attn_layer.v_proj.scale),
+                (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
+                (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
+                (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
+                (mlp_layer.down_proj.weight, mlp_layer.down_proj.scale),
+            ]
+        elif model.config.intermediate_size == 18944:
+            # for qwen2-7b
+            weights = [
+                (attn_layer.q_proj.weight, attn_layer.q_proj.scale),
+                (attn_layer.k_proj.weight, attn_layer.k_proj.scale),
+                (attn_layer.v_proj.weight, attn_layer.v_proj.scale),
+                (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
+                (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
+                (mlp_layer.down_proj_0.weight, mlp_layer.down_proj_0.scale),
+                (mlp_layer.down_proj_1.weight, mlp_layer.down_proj_1.scale)
+            ]
 
         cached_cos = curr_layer.self_attn.rotary_emb.cos_cached.to(torch.float16)
         cached_sin = curr_layer.self_attn.rotary_emb.sin_cached.to(torch.float16)

From 2f3d1bd0ec3f92f97b5fb562e9ff7dabe3a5d0dd Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Mon, 2 Sep 2024 18:11:08 +0800
Subject: [PATCH 04/16] hotfix qwen2-7b weight setting (#11991)

---
 python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
index c6cb74c8e80..e1c6f9b83b0 100644
--- a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
+++ b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
@@ -587,6 +587,7 @@ def run_decode(
                 (attn_layer.v_proj.weight, attn_layer.v_proj.scale),
                 (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
                 (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
+                (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
                 (mlp_layer.down_proj_0.weight, mlp_layer.down_proj_0.scale),
                 (mlp_layer.down_proj_1.weight, mlp_layer.down_proj_1.scale)
             ]

From 659d15defc61ac2d234de6f7deb3f07bc942b70e Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Mon, 2 Sep 2024 19:09:12 +0800
Subject: [PATCH 05/16] Fix wrong attention mask and garbage output for
 `inputs_embeds` inputs during lookup generation (#11989)

* Fix garbage output for input_embeds inputs during lookup generation

* Fix on sliding windows

* Simplify code
---
 python/llm/src/ipex_llm/transformers/lookup.py | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/lookup.py b/python/llm/src/ipex_llm/transformers/lookup.py
index c70558c7fd2..60680fafbf0 100644
--- a/python/llm/src/ipex_llm/transformers/lookup.py
+++ b/python/llm/src/ipex_llm/transformers/lookup.py
@@ -175,7 +175,7 @@ def __init__(
 
     def init_look_up_table(self,
                            input_ids: torch.LongTensor):
-        for ngram_size in range(self.max_matching_ngram_size, 0, -1):
+        for ngram_size in range(min(self.max_matching_ngram_size, input_ids.shape[1]), 0, -1):
             # Create sliding windows of size ngram_size
             windows = input_ids.cpu().unfold(dimension=1, size=ngram_size, step=1)
             for idx in range(windows.size(1)):
@@ -315,11 +315,9 @@ def lookup_generate(self,
         if step == 0:
             # first token use full model
             tic = time.time()
-            output = self(input_ids=input_ids,
-                          past_key_values=past_key_values,
-                          attention_mask=attention_mask,
-                          return_dict=True,
-                          use_cache=True)
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            output = self(**model_inputs,
+                          return_dict=True)
             logits = output['logits']
             logits = logits[:, -1:]
             logits[:, -1, :] = logits_processor(input_ids, logits[:, -1, :])

From 01099f08ee3b04c89472dde124d002e9cdfdb049 Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Tue, 3 Sep 2024 14:45:01 +0800
Subject: [PATCH 06/16] Revert prefill logic of qwen2-7b (#11992)

---
 .../transformers/npu_models/qwen2_mp.py       | 167 +++++-------------
 1 file changed, 44 insertions(+), 123 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
index e1c6f9b83b0..9ddad9391cd 100644
--- a/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
+++ b/python/llm/src/ipex_llm/transformers/npu_models/qwen2_mp.py
@@ -801,13 +801,13 @@ def run_prefill(
     input_layer_norm_weights = []
     post_attn_layernorm_weights = []
     layer_indexs = range(layer_start, layer_end)
-    if model.config.intermediate_size == 8960:
-        # for qwen2-1.5b
-        for layer_idx in layer_indexs:
-            curr_layer = model.model.layers[layer_idx]
-            attn_layer = curr_layer.self_attn
-            mlp_layer = curr_layer.mlp
+    for layer_idx in layer_indexs:
+        curr_layer = model.model.layers[layer_idx]
+        attn_layer = curr_layer.self_attn
+        mlp_layer = curr_layer.mlp
 
+        if model.config.intermediate_size == 8960:
+            # for qwen2-1.5b
             weights = [
                 (attn_layer.q_proj.weight, attn_layer.q_proj.scale),
                 (attn_layer.k_proj.weight, attn_layer.k_proj.scale),
@@ -817,53 +817,52 @@ def run_prefill(
                 (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
                 (mlp_layer.down_proj.weight, mlp_layer.down_proj.scale),
             ]
+        elif model.config.intermediate_size == 18944:
+            # for qwen2-7b
+            weights = [
+                (attn_layer.q_proj.weight, attn_layer.q_proj.scale),
+                (attn_layer.k_proj.weight, attn_layer.k_proj.scale),
+                (attn_layer.v_proj.weight, attn_layer.v_proj.scale),
+                (attn_layer.o_proj.weight, attn_layer.o_proj.scale),
+                (mlp_layer.gate_proj.weight, mlp_layer.gate_proj.scale),
+                (mlp_layer.up_proj.weight, mlp_layer.up_proj.scale),
+                (mlp_layer.down_proj_0.weight, mlp_layer.down_proj_0.scale),
+                (mlp_layer.down_proj_1.weight, mlp_layer.down_proj_1.scale)
+            ]
 
-            cached_cos = curr_layer.self_attn.rotary_emb.cos_cached.to(torch.float16)
-            cached_sin = curr_layer.self_attn.rotary_emb.sin_cached.to(torch.float16)
+        cached_cos = curr_layer.self_attn.rotary_emb.cos_cached.to(torch.float16)
+        cached_sin = curr_layer.self_attn.rotary_emb.sin_cached.to(torch.float16)
 
-            layer_norm_0 = curr_layer.input_layernorm.weight.to(torch.float16)
-            layer_norm_1 = curr_layer.post_attention_layernorm.weight.to(torch.float16)
+        layer_norm_0 = curr_layer.input_layernorm.weight.to(torch.float16)
+        layer_norm_1 = curr_layer.post_attention_layernorm.weight.to(torch.float16)
 
-            new_decoderlayer = FusedQwenLowBitDecoderlayer(
-                weights,
-                num_heads=num_heads,
-                num_key_value_heads=num_key_value_heads,
-                cached_cos=cached_cos,
-                cached_sin=cached_sin,
-                layer_norm_0=layer_norm_0,
-                layer_norm_1=layer_norm_1,
-                q_bias=attn_layer.q_proj.bias.to(torch.float16),
-                k_bias=attn_layer.k_proj.bias.to(torch.float16),
-                v_bias=attn_layer.v_proj.bias.to(torch.float16),
-                layer_idx=layer_idx,
-                rms_norm_eps=rms_norm_eps,
-                intermediate_size=intermediate_size,
-                max_seq_len=max_output_len,
-                transpose_value=transpose_value_cache,
-            )
+        new_decoderlayer = FusedQwenLowBitDecoderlayer(
+            weights,
+            num_heads=num_heads,
+            num_key_value_heads=num_key_value_heads,
+            cached_cos=cached_cos,
+            cached_sin=cached_sin,
+            layer_norm_0=layer_norm_0,
+            layer_norm_1=layer_norm_1,
+            q_bias=attn_layer.q_proj.bias.to(torch.float16),
+            k_bias=attn_layer.k_proj.bias.to(torch.float16),
+            v_bias=attn_layer.v_proj.bias.to(torch.float16),
+            layer_idx=layer_idx,
+            rms_norm_eps=rms_norm_eps,
+            intermediate_size=intermediate_size,
+            max_seq_len=max_output_len,
+            transpose_value=transpose_value_cache,
+        )
 
-            layer_weights.extend(weights)
-            input_layer_norm_weights.append(layer_norm_0)
-            post_attn_layernorm_weights.append(layer_norm_1)
-            model.model.layers[layer_idx] = new_decoderlayer
-            deocderlayers.append(new_decoderlayer)
+        layer_weights.extend(weights)
+        input_layer_norm_weights.append(layer_norm_0)
+        post_attn_layernorm_weights.append(layer_norm_1)
+        model.model.layers[layer_idx] = new_decoderlayer
+        deocderlayers.append(new_decoderlayer)
 
     print("finish creating all decode layers in prefill")
     result_queue.put("loading finish")
 
-    if model.config.intermediate_size == 18944:
-        # for qwen2-7b
-        from transformers.models.qwen2.modeling_qwen2 import Qwen2Attention
-        from ipex_llm.transformers.npu_models.convert_mp import convert_forward
-        qwen2_attention_forward = generate_qwen2_attention_forward(
-            max_seq_len=max_output_len,
-            transpose_value=transpose_value_cache
-        )
-        convert_forward(model, Qwen2Attention, qwen2_attention_forward)
-        from transformers.models.qwen2.modeling_qwen2 import Qwen2MLP
-        convert_forward(model, Qwen2MLP, split_mlp_forward)
-        deocderlayers = model.model.layers
-
     while True:
 
         result = input_queue.get()
@@ -1136,81 +1135,3 @@ def qwen2_casullm_forward(
         hidden_states=outputs.hidden_states,
         attentions=outputs.attentions,
     )
-
-
-from transformers.models.qwen2.modeling_qwen2 import apply_rotary_pos_emb, repeat_kv
-import math
-
-
-def generate_qwen2_attention_forward(max_seq_len, transpose_value):
-    def qwen2_attention_forward(
-        self,
-        hidden_states: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_value: Optional[Cache] = None,
-        output_attentions: bool = False,
-        use_cache: bool = False,
-        **kwargs,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-        bsz, q_len, _ = hidden_states.size()
-
-        query_states = self.q_proj(hidden_states)
-        key_states = self.k_proj(hidden_states)
-        value_states = self.v_proj(hidden_states)
-        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
-        key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
-                                     self.head_dim).transpose(1, 2)
-        value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
-                                         self.head_dim).transpose(1, 2)
-
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
-
-        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-        query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
-                                                        cos, sin, position_ids)
-        cache_kwargs = {"max_seq_len": max_seq_len, "transpose": transpose_value, }
-
-        if past_key_value is not None:
-            if transpose_value:
-                value_states = value_states.transpose(-1, -2)
-            key_states, value_states = past_key_value.update(key_states, value_states,
-                                                             self.layer_idx, cache_kwargs)
-
-        key_states = repeat_kv(key_states, self.num_key_value_groups)
-        value_states = repeat_kv(value_states, self.num_key_value_groups)
-
-        attn_weights = None
-        if query_states.size(2) == key_states.size(2):
-            # first token
-            from intel_npu_acceleration_library.functional import scaled_dot_product_attention
-            attn_output = scaled_dot_product_attention(
-                query_states,
-                key_states,
-                value_states,
-                attn_mask=attention_mask,
-                is_causal=q_len > 1 and bsz == 1,
-            )
-        else:
-            attn_weights = torch.matmul(query_states,
-                                        key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
-            if attention_mask is not None:
-                attn_weights = attn_weights + attention_mask
-            # upcast attention to fp32
-            attn_weights = torch.nn.functional.softmax(attn_weights, dim=-1,
-                                                       dtype=torch.float32).to(query_states.dtype)
-            attn_weights = torch.nn.functional.dropout(attn_weights, p=self.attention_dropout,
-                                                       training=self.training)
-            attn_output = torch.matmul(attn_weights, value_states)
-
-        attn_output = attn_output.transpose(1, 2).contiguous()
-        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
-
-        attn_output = self.o_proj(attn_output)
-
-        if not output_attentions:
-            attn_weights = None
-        return attn_output, attn_weights, past_key_value
-    return qwen2_attention_forward

From 643458d8f0f22ef4b12defc51bedaa43d6dc3b3f Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 3 Sep 2024 15:52:08 +0800
Subject: [PATCH 07/16] Update GraphRAG QuickStart (#11995)

* Update GraphRAG QuickStart

* Further updates

* Small fixes

* Small fix
---
 docs/mddocs/Quickstart/graphrag_quickstart.md | 99 +++++++++++++++----
 1 file changed, 80 insertions(+), 19 deletions(-)

diff --git a/docs/mddocs/Quickstart/graphrag_quickstart.md b/docs/mddocs/Quickstart/graphrag_quickstart.md
index 1e104f6df04..52e08ffe060 100644
--- a/docs/mddocs/Quickstart/graphrag_quickstart.md
+++ b/docs/mddocs/Quickstart/graphrag_quickstart.md
@@ -9,12 +9,16 @@ The [GraphRAG project](https://github.com/microsoft/graphrag) is designed to lev
 - [Setup Python Environment for GraphRAG](#3-setup-python-environment-for-graphrag)
 - [Index GraphRAG](#4-index-graphrag)
 - [Query GraphRAG](#5-query-graphrag)
+- [Query GraphRAG](#5-query-graphrag)
+- [Troubleshooting](#troubleshooting)
 
 ## Quickstart
 
 ### 1. Install and Start `Ollama` Service on Intel GPU 
 
-Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
+Follow the steps in [Run Ollama with IPEX-LLM on Intel GPU Guide](./ollama_quickstart.md) to install `ipex-llm[cpp]==2.1.0` and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`).
+
+**Please note that for GraphRAG, we highly recommand using the stable version of ipex-llm through `pip install ipex-llm[cpp]==2.1.0`**.
 
 ### 2. Prepare LLM and Embedding Model
 
@@ -57,6 +61,7 @@ conda create -n graphrag-local-ollama python=3.10
 conda activate graphrag-local-ollama
 
 pip install -e .
+pip install future
 
 pip install ollama
 pip install plotly
@@ -64,6 +69,9 @@ pip install plotly
 
 in which `pip install ollama` is for enabling restful APIs through python, and `pip install plotly` is for visualizing the knowledge graph.
 
+> [!NOTE]
+> Please note that the Python environment for GraphRAG setup here is separate from the one for Ollama server on Intel GPUs.
+
 ### 4. Index GraphRAG
 
 The environment is now ready for GraphRAG with local LLMs and embedding models running on Intel GPUs. Before querying GraphRAG, it is necessary to first index GraphRAG, which could be a resource-intensive operation.
@@ -114,24 +122,25 @@ Perpare the input corpus, and then initialize the workspace:
 #### Update `settings.yml`
 
 In the `settings.yml` file inside the `ragtest` folder, add the configuration `request_timeout: 1800.0` for `llm`. Besides, if you would like to use LLMs or embedding models other than `mistral` or `nomic-embed-text`, you are required to update the `settings.yml` in `ragtest` folder accordingly:
->
-> ```yml
-> llm:
->   api_key: ${GRAPHRAG_API_KEY}
->   type: openai_chat
->   model: mistral # change it accordingly if using another LLM
->   model_supports_json: true
->   request_timeout: 1800.0 # add this configuration; you could also increase the request_timeout
->   api_base: http://localhost:11434/v1
-> 
-> embeddings:
->   async_mode: threaded
->   llm:
->     api_key: ${GRAPHRAG_API_KEY}
->     type: openai_embedding
->     model: nomic_embed_text # change it accordingly if using another embedding model
->     api_base: http://localhost:11434/api
-> ```
+
+
+```yml
+llm:
+  api_key: ${GRAPHRAG_API_KEY}
+  type: openai_chat
+  model: mistral # change it accordingly if using another LLM
+  model_supports_json: true
+  request_timeout: 1800.0 # add this configuration; you could also increase the request_timeout
+  api_base: http://localhost:11434/v1
+
+embeddings:
+  async_mode: threaded
+  llm:
+    api_key: ${GRAPHRAG_API_KEY}
+    type: openai_embedding
+    model: nomic_embed_text # change it accordingly if using another embedding model
+    api_base: http://localhost:11434/api
+```
 
 #### Conduct GraphRAG indexing
 
@@ -197,3 +206,55 @@ The Transformer model has been very successful in various natural language proce
 
 Since its initial introduction, the Transformer model has been further developed and improved upon. Variants of the Transformer architecture, such as BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimized BERT Pretraining Approach), have achieved state-of-the-art performance on a wide range of natural language processing tasks [Data: Reports (1, 2, 34, 46, 64, +more)].
 ```
+
+### Troubleshooting
+
+#### `failed to find free space in the KV cache, retrying with smaller n_batch` when conducting GraphRAG Indexing, and `JSONDecodeError` when querying GraphRAG
+
+If you observe the Ollama server log showing `failed to find free space in the KV cache, retrying with smaller n_batch` while conducting GraphRAG indexing, and receive `JSONDecodeError` when querying GraphRAG, try to increase context length for the LLM model and index/query GraphRAG again.
+
+Here introduce how to make the LLM model support larger context. To do this, we need to first create a file named `Modelfile`:
+
+```
+FROM mistral:latest
+PARAMETER num_ctx 4096
+```
+
+> [!TIP]
+> Here we increase `num_ctx` to 4096 as an example. You could adjust it accordingly.
+
+and then use the following commands to create a new model in Ollama named `mistral:latest-nctx4096`:
+
+- For **Linux users**:
+
+  ```bash
+  ./ollama create mistral:latest-nctx4096 -f Modelfile
+  ```
+
+- For **Windows users**:
+
+  Please run the following command in Miniforge or Anaconda Prompt.
+
+  ```cmd
+  ollama create mistral:latest-nctx4096 -f Modelfile
+  ```
+
+Finally, update `settings.yml` inside the `ragtest` folder to use `llm` model `mistral:latest-nctx4096`:
+
+```yml
+llm:
+  api_key: ${GRAPHRAG_API_KEY}
+  type: openai_chat
+  model: mistral:latest-nctx4096 # change it accordingly if using another LLM, or LLM model with larger num_ctx
+  model_supports_json: true
+  request_timeout: 1800.0 # add this configuration; you could also increase the request_timeout
+  api_base: http://localhost:11434/v1
+
+embeddings:
+  async_mode: threaded
+  llm:
+    api_key: ${GRAPHRAG_API_KEY}
+    type: openai_embedding
+    model: nomic_embed_text # change it accordingly if using another embedding model
+    api_base: http://localhost:11434/api
+```
\ No newline at end of file

From 2e54f4402b503ff3835b8f423c0d1b436cce7943 Mon Sep 17 00:00:00 2001
From: "Jin, Qiao" <89779290+JinBridger@users.noreply.github.com>
Date: Tue, 3 Sep 2024 16:50:42 +0800
Subject: [PATCH 08/16] Rename MiniCPM-V-2_6 CPU example (#11998)

---
 README.md                                                 | 2 +-
 .../Model/{minicpm-v => minicpm-v-2_6}/README.md          | 8 ++++----
 .../Model/{minicpm-v => minicpm-v-2_6}/chat.py            | 4 ++--
 3 files changed, 7 insertions(+), 7 deletions(-)
 rename python/llm/example/CPU/HF-Transformers-AutoModels/Model/{minicpm-v => minicpm-v-2_6}/README.md (88%)
 rename python/llm/example/CPU/HF-Transformers-AutoModels/Model/{minicpm-v => minicpm-v-2_6}/chat.py (97%)

diff --git a/README.md b/README.md
index a34c880b782..b211f604822 100644
--- a/README.md
+++ b/README.md
@@ -319,7 +319,7 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
 | MiniCPM-V |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V) |
 | MiniCPM-V-2 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2) |
 | MiniCPM-Llama3-V-2_5 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-V-2_6 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v) | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | 
+| MiniCPM-V-2_6 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6) | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | 
 
 ## Get Support
 - Please report a bug or raise a feature request by opening a [Github Issue](https://github.com/intel-analytics/ipex-llm/issues)
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/README.md
similarity index 88%
rename from python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md
rename to python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/README.md
index 640be289d36..4e0955b82d2 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/README.md
@@ -1,11 +1,11 @@
-# MiniCPM-V
-In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on MiniCPM-V models. For illustration purposes, we utilize the [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) as a reference MiniCPM-V model.
+# MiniCPM-V-2_6
+In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on MiniCPM-V-2_6 models. For illustration purposes, we utilize the [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) as a reference MiniCPM-V-2_6 model.
 
 ## 0. Requirements
 To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 
 ## Example: Predict Tokens using `chat()` API
-In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-V model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations.
+In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-V-2_6 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations.
 ### 1. Install
 We suggest using conda to manage environment:
 
@@ -47,7 +47,7 @@ pip install transformers==4.40.0 trl
 
 
 Arguments info:
-- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-V model (e.g. `openbmb/MiniCPM-V-2_6`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-V-2_6'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-V-2_6 model (e.g. `openbmb/MiniCPM-V-2_6`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-V-2_6'`.
 - `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
 - `--stream`: flag to chat in streaming mode
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/chat.py
similarity index 97%
rename from python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py
rename to python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/chat.py
index e0a07c59aa8..a6f44bd0ed3 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v/chat.py
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6/chat.py
@@ -26,9 +26,9 @@
 
 
 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='Predict Tokens using `chat()` API for MiniCPM-V model')
+    parser = argparse.ArgumentParser(description='Predict Tokens using `chat()` API for MiniCPM-V-2_6 model')
     parser.add_argument('--repo-id-or-model-path', type=str, default="openbmb/MiniCPM-V-2_6",
-                        help='The huggingface repo id for the MiniCPM-V model to be downloaded'
+                        help='The huggingface repo id for the MiniCPM-V-2_6 model to be downloaded'
                              ', or the path to the huggingface checkpoint folder')
     parser.add_argument('--image-url-or-path', type=str,
                         default='http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg',

From 164f47adbd628cbee3e6487be1df25201ae220d0 Mon Sep 17 00:00:00 2001
From: Jinhe <jin.tang1337@gmail.com>
Date: Tue, 3 Sep 2024 17:02:06 +0800
Subject: [PATCH 09/16] MiniCPM-V-2 & MiniCPM-Llama3-V-2_5 example updates
 (#11988)

* minicpm example updates

* --stream
---
 .../Multimodal/MiniCPM-Llama3-V-2_5/README.md | 32 +++++---
 .../{generate.py => chat.py}                  | 75 ++++++++++++-------
 .../Multimodal/MiniCPM-V-2/README.md          | 31 +++++---
 .../MiniCPM-V-2/{generate.py => chat.py}      | 61 +++++++++++----
 .../Multimodal/MiniCPM-V-2_6/README.md        |  4 +-
 5 files changed, 143 insertions(+), 60 deletions(-)
 rename python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/{generate.py => chat.py} (61%)
 rename python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/{generate.py => chat.py} (82%)

diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/README.md
index 8d88fbb23a6..ee653b58136 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/README.md
@@ -5,7 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 
 ## Example: Predict Tokens using `chat()` API
-In the example [generate.py](./generate.py), we show a basic use case for a MiniCPM-Llama3-V-2_5 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
+In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-Llama3-V-2_5 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 ### 1. Install
 #### 1.1 Installation on Linux
 We suggest using conda to manage environment:
@@ -106,15 +106,20 @@ set SYCL_CACHE_PERSISTENT=1
 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ### 4. Running examples
 
-```
-python ./generate.py --prompt 'What is in the image?'
-```
+- chat without streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?'
+  ```
+- chat in streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?' --stream
+  ```
 
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-Llama3-V-2_5 (e.g. `openbmb/MiniCPM-Llama3-V-2_5`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-Llama3-V-2_5'`.
 - `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
-- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
+- `--stream`: flag to chat in streaming mode
 
 #### Sample Output
 
@@ -122,12 +127,21 @@ Arguments info:
 
 ```log
 Inference time: xxxx s
--------------------- Input --------------------
+-------------------- Input Image --------------------
 http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
--------------------- Prompt --------------------
+-------------------- Input Prompt --------------------
 What is in the image?
--------------------- Output --------------------
-The image features a young child holding a white teddy bear. The teddy bear is dressed in a pink outfit. The child appears to be outdoors, with a stone wall and some red flowers in the background.
+-------------------- Chat Output --------------------
+The image features a young child holding a white teddy bear. The teddy bear is dressed in a pink dress with a ribbon on it. The child appears to be smiling and enjoying the moment.
+```
+```log
+Inference time: xxxx s
+-------------------- Input Image --------------------
+http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+-------------------- Input Prompt --------------------
+图片里有什么？
+-------------------- Chat Output --------------------
+图片中有一个小孩，手里拿着一个白色的玩具熊。这个孩子看起来很开心，正在微笑并与玩具互动。背景包括红色的花朵和石墙，为这个场景增添了色彩和质感。
 ```
 
 The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/chat.py
similarity index 61%
rename from python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/generate.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/chat.py
index fe5ab5e1014..66aa46304db 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/generate.py
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5/chat.py
@@ -14,10 +14,12 @@
 # limitations under the License.
 #
 
+
 import os
 import time
 import argparse
 import requests
+import torch
 from PIL import Image
 from ipex_llm.transformers import AutoModel
 from transformers import AutoTokenizer
@@ -33,8 +35,8 @@
                         help='The URL or path to the image to infer')
     parser.add_argument('--prompt', type=str, default="What is in the image?",
                         help='Prompt to infer')
-    parser.add_argument('--n-predict', type=int, default=32,
-                        help='Max tokens to predict')
+    parser.add_argument('--stream', action='store_true',
+                        help='Whether to chat in streaming mode')
 
     args = parser.parse_args()
     model_path = args.repo_id_or_model_path
@@ -45,11 +47,12 @@
     # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
     # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
     model = AutoModel.from_pretrained(model_path, 
-                                      load_in_4bit=True,
-                                      optimize_model=False,
+                                      load_in_low_bit="sym_int4",
+                                      optimize_model=True,
                                       trust_remote_code=True,
-                                      use_cache=True)
-    model = model.half().to(device='xpu')
+                                      use_cache=True,
+                                      modules_to_not_convert=["vpm", "resampler"])
+    model = model.half().to('xpu')
     tokenizer = AutoTokenizer.from_pretrained(model_path,
                                               trust_remote_code=True)
     model.eval()
@@ -61,23 +64,45 @@
        image = Image.open(requests.get(image_path, stream=True).raw).convert('RGB')
 
     # Generate predicted tokens
-    # here the prompt tuning refers to https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/main/README.md
-    msgs = [{'role': 'user', 'content': args.prompt}]
-    st = time.time()
-    res = model.chat(
-     image=image,
-     msgs=msgs,
-     context=None,
-     tokenizer=tokenizer,
-     sampling=False,
-     temperature=0.7
+    # here the prompt tuning refers to https://huggingface.co/openbmb/MiniCPM-V-2_6/blob/main/README.md
+    msgs = [{'role': 'user', 'content': [image, args.prompt]}]
+
+    # ipex_llm model needs a warmup, then inference time can be accurate
+    model.chat(
+        image=None,
+        msgs=msgs,
+        tokenizer=tokenizer,
     )
-    end = time.time()
-    print(f'Inference time: {end-st} s')
-    print('-'*20, 'Input', '-'*20)
-    print(image_path)
-    print('-'*20, 'Prompt', '-'*20)
-    print(args.prompt)
-    output_str = res
-    print('-'*20, 'Output', '-'*20)
-    print(output_str)
+
+    if args.stream:
+        res = model.chat(
+            image=None,
+            msgs=msgs,
+            tokenizer=tokenizer,
+            stream=True
+        )
+
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Stream Chat Output', '-'*20)
+        for new_text in res:
+            print(new_text, flush=True, end='')
+    else:
+        st = time.time()
+        res = model.chat(
+            image=None,
+            msgs=msgs,
+            tokenizer=tokenizer,
+        )
+        torch.xpu.synchronize()
+        end = time.time()
+
+        print(f'Inference time: {end-st} s')
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Chat Output', '-'*20)
+        print(res)
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/README.md
index da5f94007c9..aed936fb277 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/README.md
@@ -5,7 +5,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 
 ## Example: Predict Tokens using `chat()` API
-In the example [generate.py](./generate.py), we show a basic use case for a MiniCPM-V-2 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
+In the example [chat.py](./chat.py), we show a basic use case for a MiniCPM-V-2 model to predict the next N tokens using `chat()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
 ### 1. Install
 #### 1.1 Installation on Linux
 We suggest using conda to manage environment:
@@ -106,15 +106,20 @@ set SYCL_CACHE_PERSISTENT=1
 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ### 4. Running examples
 
-```
-python ./generate.py --prompt 'What is in the image?'
-```
+- chat without streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?'
+  ```
+- chat in streaming mode:
+  ```
+  python ./chat.py --prompt 'What is in the image?' --stream
+  ```
 
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the MiniCPM-V-2 (e.g. `openbmb/MiniCPM-V-2`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'openbmb/MiniCPM-V-2'`.
 - `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
-- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
+- `--stream`: flag to chat in streaming mode
 
 #### Sample Output
 
@@ -122,12 +127,20 @@ Arguments info:
 
 ```log
 Inference time: xxxx s
--------------------- Input --------------------
+-------------------- Input Image --------------------
 http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
--------------------- Prompt --------------------
+-------------------- Input Prompt --------------------
 What is in the image?
--------------------- Output --------------------
-In the image, there is a young child holding a teddy bear. The teddy bear appears to be dressed in a pink tutu. The child is also wearing a red and white striped dress. The background of the image includes a stone wall and some red flowers.
+-------------------- Chat Output --------------------
+In the image, there is a young child holding a teddy bear. The teddy bear is dressed in a pink tutu. The child is also wearing a red and white striped dress. The background of the image features a stone wall and some red flowers.
+```
+```log
+-------------------- Input Image --------------------
+http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+-------------------- Input Prompt --------------------
+图片里有什么？
+-------------------- Chat Output --------------------
+图中是一个小女孩，她手里拿着一只粉白相间的泰迪熊。
 ```
 
 The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/chat.py
similarity index 82%
rename from python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/generate.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/chat.py
index 91ae81d2a26..93441c84bbb 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/generate.py
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2/chat.py
@@ -15,6 +15,7 @@
 #
 
 
+
 from typing import List, Tuple, Optional, Union
 import math
 import timm
@@ -110,6 +111,7 @@ def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
 import time
 import argparse
 import requests
+import torch
 from PIL import Image
 from ipex_llm.transformers import AutoModel
 from transformers import AutoTokenizer
@@ -125,8 +127,8 @@ def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
                         help='The URL or path to the image to infer')
     parser.add_argument('--prompt', type=str, default="What is in the image?",
                         help='Prompt to infer')
-    parser.add_argument('--n-predict', type=int, default=32,
-                        help='Max tokens to predict')
+    parser.add_argument('--stream', action='store_true',
+                        help='Whether to chat in streaming mode')
 
     args = parser.parse_args()
     model_path = args.repo_id_or_model_path
@@ -140,9 +142,9 @@ def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
                                       load_in_low_bit="asym_int4",
                                       optimize_model=True,
                                       trust_remote_code=True,
-                                      modules_to_not_convert=["vpm", "resampler", "lm_head"],
-                                      use_cache=True)
-    model = model.half().to(device='xpu')
+                                      use_cache=True,
+                                      modules_to_not_convert=["vpm", "resampler"])
+    model = model.half().to('xpu')
     tokenizer = AutoTokenizer.from_pretrained(model_path,
                                               trust_remote_code=True)
     model.eval()
@@ -156,7 +158,8 @@ def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
     # Generate predicted tokens
     # here the prompt tuning refers to https://huggingface.co/openbmb/MiniCPM-V-2/blob/main/README.md
     msgs = [{'role': 'user', 'content': args.prompt}]
-    st = time.time()
+
+    # ipex_llm model needs a warmup, then inference time can be accurate
     res, context, _ = model.chat(
      image=image,
      msgs=msgs,
@@ -165,12 +168,40 @@ def _pos_embed(self, x: torch.Tensor) -> torch.Tensor:
      sampling=False,
      temperature=0.7
     )
-    end = time.time()
-    print(f'Inference time: {end-st} s')
-    print('-'*20, 'Input', '-'*20)
-    print(image_path)
-    print('-'*20, 'Prompt', '-'*20)
-    print(args.prompt)
-    output_str = res
-    print('-'*20, 'Output', '-'*20)
-    print(output_str)
+    if args.stream:
+        res, context, _ = model.chat(
+        image=image,
+        msgs=msgs,
+        context=None,
+        tokenizer=tokenizer,
+        sampling=False,
+        temperature=0.7
+        )
+
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Stream Chat Output', '-'*20)
+        for new_text in res:
+            print(new_text, flush=True, end='')
+    else:
+        st = time.time()
+        res, context, _ = model.chat(
+        image=image,
+        msgs=msgs,
+        context=None,
+        tokenizer=tokenizer,
+        sampling=False,
+        temperature=0.7
+        )
+        torch.xpu.synchronize()
+        end = time.time()
+
+        print(f'Inference time: {end-st} s')
+        print('-'*20, 'Input Image', '-'*20)
+        print(image_path)
+        print('-'*20, 'Input Prompt', '-'*20)
+        print(args.prompt)
+        print('-'*20, 'Chat Output', '-'*20)
+        print(res)
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6/README.md
index 3a47448f6c2..6063a286b4a 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6/README.md
@@ -108,11 +108,11 @@ set SYCL_CACHE_PERSISTENT=1
 
 - chat without streaming mode:
   ```
-  python ./generate.py --prompt 'What is in the image?'
+  python ./chat.py --prompt 'What is in the image?'
   ```
 - chat in streaming mode:
   ```
-  python ./generate.py --prompt 'What is in the image?' --stream
+  python ./chat.py --prompt 'What is in the image?' --stream
   ```
 
 > [!TIP]

From 6eb55653bae82e917af9cfea260550109c96352f Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 3 Sep 2024 17:46:16 +0800
Subject: [PATCH 10/16] Performance mode strategy update for input_embeds input
 (#11997)

---
 python/llm/src/ipex_llm/transformers/lookup.py | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/lookup.py b/python/llm/src/ipex_llm/transformers/lookup.py
index 60680fafbf0..c5fe81d49ab 100644
--- a/python/llm/src/ipex_llm/transformers/lookup.py
+++ b/python/llm/src/ipex_llm/transformers/lookup.py
@@ -60,21 +60,24 @@ def generate(
     lookahead = kwargs.pop("lookahead", None)
     perf_mode = os.environ.get("IPEX_LLM_PERFORMANCE_MODE", None)
 
-    input_ids_shape = None
+    input_tensor_shape = None
+    is_inputs_embeds = False
     if inputs is not None:
-        input_ids_shape = inputs.shape
+        input_tensor_shape = inputs.shape
     else:
         input_ids = kwargs.get("input_ids", None)
         if input_ids is not None:
-            input_ids_shape = input_ids.shape
+            input_tensor_shape = input_ids.shape
         else:
             inputs_embeds = kwargs.get("inputs_embeds", None)
             if inputs_embeds is not None:
-                input_ids_shape = inputs_embeds.shape
+                is_inputs_embeds = True
+                input_tensor_shape = inputs_embeds.shape
 
     if perf_mode == "1" and lookahead is None:
-        if input_ids_shape is not None and \
-                input_ids_shape[1] >= PERFORMANCE_MODE_LOOKUP_INPUT_THRESHOLD:
+        if input_tensor_shape is not None and \
+                input_tensor_shape[1] >= PERFORMANCE_MODE_LOOKUP_INPUT_THRESHOLD \
+                and not is_inputs_embeds:
             lookahead = 2  # default to 2 now
 
     if lookahead:
@@ -85,7 +88,7 @@ def generate(
             logger.warning("Prompt lookup is currently not supported on CPU with IPEX, "
                            "fallback to original generate.")
             kwargs.pop("max_matching_ngram_size", None)
-        elif input_ids_shape is not None and input_ids_shape[0] > 1:
+        elif input_tensor_shape is not None and input_tensor_shape[0] > 1:
             logger.warning("Prompt lookup is currently not supported with batch inference, "
                            "fallback to original generate.")
             kwargs.pop("max_matching_ngram_size", None)

From 9eaff5e47d05f0e8d4c5302c1bbb94b774b6971b Mon Sep 17 00:00:00 2001
From: Ruonan Wang <ruonan1.wang@intel.com>
Date: Tue, 3 Sep 2024 05:53:22 -0700
Subject: [PATCH 11/16] add save &  load support for NPU optimized model
 (#11999)

* add save &  load support

* fix style
---
 .../src/ipex_llm/transformers/npu_model.py    | 52 ++++++++++++++++---
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/npu_model.py b/python/llm/src/ipex_llm/transformers/npu_model.py
index df18d597394..63487dfaf92 100644
--- a/python/llm/src/ipex_llm/transformers/npu_model.py
+++ b/python/llm/src/ipex_llm/transformers/npu_model.py
@@ -174,6 +174,7 @@ def from_pretrained(cls, *args, **kwargs):
                 intra_pp=intra_pp,
                 transpose_value_cache=transpose_value_cache,
             )
+            model.save_low_bit = types.MethodType(save_low_bit, model)
         else:
             from ipex_llm.transformers.npu_models.convert import optimize_llm
             optimize_llm(model)
@@ -209,10 +210,16 @@ def load_low_bit(cls, pretrained_model_name_or_path: str, *model_args, **kwargs)
         ignore_argument(kwargs, "lightweight_bmm")
         ignore_argument(kwargs, "cpu_embedding")
         ignore_argument(kwargs, "embedding_qtype")
-        ignore_argument(kwargs, "optimize_model")
         ignore_argument(kwargs, "modules_to_not_convert")
         ignore_argument(kwargs, "speculative")
         ignore_argument(kwargs, "pipeline_parallel_stages")
+        optimize_model = kwargs.pop("optimize_model", False)
+        max_output_len = kwargs.pop("max_output_len", 1024)
+        max_prompt_len = kwargs.pop("max_prompt_len", 512)
+        inter_pp = kwargs.pop("inter_pp", None)
+        intra_pp = kwargs.pop("intra_pp", None)
+        transpose_value_cache = kwargs.pop("transpose_value_cache", True)
+        modules_to_not_convert = kwargs.pop("modules_to_not_convert", [])
 
         from transformers.models.auto.configuration_auto import AutoConfig
         from transformers.modeling_utils import no_init_weights, get_state_dict_dtype
@@ -351,12 +358,34 @@ def load_low_bit(cls, pretrained_model_name_or_path: str, *model_args, **kwargs)
         logger.info(f"Converting model, it may takes up to several minutes ...")
         from intel_npu_acceleration_library.compiler import create_npu_kernels
 
-        with torch.no_grad():
-            optimize_llm(model)
-            cls.load_convert(qtype, model, quant_device, *model_args, **kwargs)
-            create_npu_kernels(model)
+        if optimize_model:
+            invalidInputError(
+                max_prompt_len < max_output_len,
+                (
+                    f"max_prompt_len ({max_prompt_len}) should be less"
+                    " than max_output_len ({max_output_len})"
+                ),
+            )
+            from ipex_llm.transformers.npu_models.convert_mp import optimize_llm_pre
+
+            if hasattr(model, "llm"):
+                llm = model.llm
+            else:
+                llm = model
+
+            with torch.no_grad():
+                optimize_llm_pre(model, qtype)
+                cls.load_convert(qtype, model, quant_device, modules_to_not_convert,
+                                 *model_args, **kwargs)
+                create_npu_kernels(llm)
 
-        model = model.eval()
+        else:
+            from ipex_llm.transformers.npu_models.convert import optimize_llm
+            optimize_llm(model)
+            with torch.no_grad():
+                cls.load_convert(qtype, model, quant_device, modules_to_not_convert,
+                                 *model_args, **kwargs)
+                create_npu_kernels(model)
 
         if is_sharded:
             loaded_state_dict_keys = sharded_metadata["all_checkpoint_keys"]
@@ -415,6 +444,17 @@ def load_low_bit(cls, pretrained_model_name_or_path: str, *model_args, **kwargs)
         for param in model.parameters():
             param.requires_grad_(False)
 
+        if optimize_model:
+            from ipex_llm.transformers.npu_models.convert_mp import optimize_llm
+            optimize_llm(
+                llm,
+                max_output_len=max_output_len,
+                max_prompt_len=max_prompt_len,
+                inter_pp=inter_pp,
+                intra_pp=intra_pp,
+                transpose_value_cache=transpose_value_cache,
+            )
+
         return model
 
 

From 2b993ad4797d395303a084aa6b72bf2ef20e7593 Mon Sep 17 00:00:00 2001
From: "Wang, Jian4" <61138589+hzjane@users.noreply.github.com>
Date: Wed, 4 Sep 2024 13:50:32 +0800
Subject: [PATCH 12/16] vllm update for glm-4 model automatic not_convert
 (#12003)

---
 python/llm/src/ipex_llm/vllm/xpu/model_convert.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/llm/src/ipex_llm/vllm/xpu/model_convert.py b/python/llm/src/ipex_llm/vllm/xpu/model_convert.py
index 7979bfbc62a..065652e7162 100644
--- a/python/llm/src/ipex_llm/vllm/xpu/model_convert.py
+++ b/python/llm/src/ipex_llm/vllm/xpu/model_convert.py
@@ -250,7 +250,8 @@ def _ipex_llm_load_model(self) -> None:
             from ipex_llm import optimize_model
             import os
             not_convert_last_mlp = os.getenv("IPEX_LLM_NOT_CONVERT_LAST_MLP", None)
-            if not_convert_last_mlp is not None:
+            is_glm4_model = "glm-4" in self.model_config.model.lower()
+            if not_convert_last_mlp is not None or is_glm4_model:
                 # only use to avoid nan value in last mlp forward running glm4-9b-chat
                 modules = ["35.mlp", "36.mlp", "37.mlp", "38.mlp", "39.mlp"]
             else:

From 77cb3482209b56e068cdf8d9b0821bd9aae58277 Mon Sep 17 00:00:00 2001
From: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
Date: Wed, 4 Sep 2024 17:13:45 +0800
Subject: [PATCH 13/16] fix dependabot alerts (#12006)

* fix dependabot alerts

* update
---
 .github/actions/llm/download-llm-binary/action.yml |  2 +-
 .github/workflows/llm-c-evaluation.yml             | 12 ++++++------
 .github/workflows/llm-harness-evaluation.yml       | 12 ++++++------
 .github/workflows/llm-ppl-evaluation.yml           | 12 ++++++------
 .github/workflows/llm-whisper-evaluation.yml       | 12 ++++++------
 5 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/.github/actions/llm/download-llm-binary/action.yml b/.github/actions/llm/download-llm-binary/action.yml
index c15bfe49f2a..19de28ec43b 100644
--- a/.github/actions/llm/download-llm-binary/action.yml
+++ b/.github/actions/llm/download-llm-binary/action.yml
@@ -10,7 +10,7 @@ runs:
   using: "composite"
   steps:
     - name: Download all build files
-      uses: actions/download-artifact@v3
+      uses: actions/download-artifact@4.1.7
     - name: Move build resources
       shell: bash
       run: |
diff --git a/.github/workflows/llm-c-evaluation.yml b/.github/workflows/llm-c-evaluation.yml
index ac212d33280..cb890f19eec 100644
--- a/.github/workflows/llm-c-evaluation.yml
+++ b/.github/workflows/llm-c-evaluation.yml
@@ -12,10 +12,10 @@ permissions:
 on:
   # schedule:
   #   - cron: "00 15 * * *" # GMT time, 15:00 GMT == 23:00 Beijing Time
-  pull_request:
-    branches: [main]
-    paths:
-      - ".github/workflows/llm-c-evaluation.yml"
+  # pull_request:
+  #   branches: [main]
+  #   paths:
+  #     - ".github/workflows/llm-c-evaluation.yml"
   # Allows you to run this workflow manually from the Actions tab
   workflow_dispatch:
     inputs:
@@ -204,7 +204,7 @@ jobs:
           pip install pandas==1.5.3
 
       - name: Download ceval results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: ceval_results
           path: results
@@ -259,7 +259,7 @@ jobs:
           fi
 
       - name: Download ceval results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: results_${{ needs.set-matrix.outputs.date }}
           path:  ${{ env.ACC_FOLDER }}
diff --git a/.github/workflows/llm-harness-evaluation.yml b/.github/workflows/llm-harness-evaluation.yml
index 8e9b9bf7d25..839393bb49c 100644
--- a/.github/workflows/llm-harness-evaluation.yml
+++ b/.github/workflows/llm-harness-evaluation.yml
@@ -12,10 +12,10 @@ permissions:
 on:
   # schedule:
   #   - cron: "30 12 * * *" # GMT time, 12:30 GMT == 20:30 China
-  pull_request:
-    branches: [main]
-    paths:
-      - ".github/workflows/llm-harness-evaluation.yml"
+  # pull_request:
+  #   branches: [main]
+  #   paths:
+  #     - ".github/workflows/llm-harness-evaluation.yml"
   # Allows you to run this workflow manually from the Actions tab
   workflow_dispatch:
     inputs:
@@ -220,7 +220,7 @@ jobs:
           pip install --upgrade pip
           pip install jsonlines  pytablewriter regex
       - name: Download all results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: harness_results
           path: results        
@@ -260,7 +260,7 @@ jobs:
           fi
 
       - name: Download harness results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: harness_results
           path: ${{ env.ACC_FOLDER}}/${{ env.DATE }}
diff --git a/.github/workflows/llm-ppl-evaluation.yml b/.github/workflows/llm-ppl-evaluation.yml
index d1d0be00499..6a64502ffbb 100644
--- a/.github/workflows/llm-ppl-evaluation.yml
+++ b/.github/workflows/llm-ppl-evaluation.yml
@@ -12,10 +12,10 @@ permissions:
 on:
   # schedule:
   #   - cron: "00 12 * * *" # GMT time, 12:00 GMT == 20:00 China
-  pull_request:
-    branches: [main]
-    paths:
-      - ".github/workflows/llm-ppl-evaluation.yml"
+  # pull_request:
+  #   branches: [main]
+  #   paths:
+  #     - ".github/workflows/llm-ppl-evaluation.yml"
   # Allows you to run this workflow manually from the Actions tab
   workflow_dispatch:
     inputs:
@@ -206,7 +206,7 @@ jobs:
           pip install --upgrade pip
           pip install jsonlines  pytablewriter regex
       - name: Download all results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: ppl_results
           path: results        
@@ -245,7 +245,7 @@ jobs:
           fi
   
       - name: Download ppl results
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: ppl_results
           path: ${{ env.ACC_FOLDER}}/${{ env.DATE }}
diff --git a/.github/workflows/llm-whisper-evaluation.yml b/.github/workflows/llm-whisper-evaluation.yml
index bde7929c5a0..538e10e56b0 100644
--- a/.github/workflows/llm-whisper-evaluation.yml
+++ b/.github/workflows/llm-whisper-evaluation.yml
@@ -12,10 +12,10 @@ permissions:
 on:
   # schedule:
   #   - cron: "00 13 * * *" # GMT time, 13:00 GMT == 21:00 China
-  pull_request:
-    branches: [main]
-    paths:
-      - ".github/workflows/llm-whisper-evaluation.yml"
+  # pull_request:
+  #   branches: [main]
+  #   paths:
+  #     - ".github/workflows/llm-whisper-evaluation.yml"
   # Allows you to run this workflow manually from the Actions tab
   workflow_dispatch:
     inputs:
@@ -176,14 +176,14 @@ jobs:
 
       - name: Download all results for nightly run
         if: github.event_name == 'schedule'
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: whisper_results
           path: ${{ env.NIGHTLY_FOLDER}}/${{ env.OUTPUT_PATH }}
 
       - name: Download all results for pr run
         if: github.event_name == 'pull_request'
-        uses: actions/download-artifact@v3
+        uses: actions/download-artifact@4.1.7
         with:
           name: whisper_results
           path: ${{ env.PR_FOLDER}}/${{ env.OUTPUT_PATH }}

From b1408a1f1c1d2b9c99e61b5269d82b46ebde5af5 Mon Sep 17 00:00:00 2001
From: Yishuo Wang <yishuo.wang@intel.com>
Date: Wed, 4 Sep 2024 18:02:49 +0800
Subject: [PATCH 14/16] fix UT (#12005)

---
 .../test/inference_gpu/test_transformers_api_attention.py   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/llm/test/inference_gpu/test_transformers_api_attention.py b/python/llm/test/inference_gpu/test_transformers_api_attention.py
index 84bdcf8e8cb..c18a52bb201 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_attention.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_attention.py
@@ -151,7 +151,7 @@ def Llama2_7B_gpu_model(self, Name, Model, Tokenizer, model_path):
         # currently only compare the output of the last self-attention layer.
         layer_norm = "model.layers.31.input_layernorm"
         self_attn = "model.layers.31.self_attn"
-        lower_bound = 8e-3
+        lower_bound = 2e-2
         self.run_optimize_gpu_model(Name, Model, Tokenizer, model_path, self_attn, layer_norm, lower_bound)
 
     def Falcon_7B_gpu_model(self, Name, Model, Tokenizer, model_path):
@@ -165,7 +165,7 @@ def Chatglm2_gpu_model(self, Name, Model, Tokenizer, model_path):
         # currently only need to compare the output of one self-attention layer.
         layer_norm = "transformer.encoder.layers.27.input_layernorm"
         self_attn = "transformer.encoder.layers.27.self_attention"
-        lower_bound = 4e-2
+        lower_bound = 1e-1
         self.run_optimize_gpu_model(Name, Model, Tokenizer, model_path, self_attn, layer_norm, lower_bound)
 
     def Mistral_gpu_model(self, Name, Model, Tokenizer, model_path):
@@ -182,7 +182,7 @@ def Baichuan_gpu_model(self, Name, Model, Tokenizer, model_path):
         # currently only need to compare the output of one self-attention layer.
         layer_norm = "model.layers.31.input_layernorm"
         self_attn = "model.layers.31.self_attn"
-        lower_bound = 8e-3
+        lower_bound = 2e-2
         self.run_optimize_gpu_model(Name, Model, Tokenizer, model_path, self_attn, layer_norm, lower_bound)
 
     def Qwen_gpu_model(self, Name, Model, Tokenizer, model_path):

From c6348a4666dba7b5f3f686751bb3f4edc212761d Mon Sep 17 00:00:00 2001
From: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
Date: Wed, 4 Sep 2024 22:12:24 +0800
Subject: [PATCH 15/16] Update action.yml (#12016)

---
 .github/actions/llm/download-llm-binary/action.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/actions/llm/download-llm-binary/action.yml b/.github/actions/llm/download-llm-binary/action.yml
index 19de28ec43b..107432cf697 100644
--- a/.github/actions/llm/download-llm-binary/action.yml
+++ b/.github/actions/llm/download-llm-binary/action.yml
@@ -10,7 +10,7 @@ runs:
   using: "composite"
   steps:
     - name: Download all build files
-      uses: actions/download-artifact@4.1.7
+      uses: actions/download-artifact@v4.1.7
     - name: Move build resources
       shell: bash
       run: |

From 75b19f8522516fc0c15b240eeaed1c64164a016d Mon Sep 17 00:00:00 2001
From: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
Date: Wed, 4 Sep 2024 22:39:07 +0800
Subject: [PATCH 16/16] revert actions/download-artifact version to 3 (#12017)

---
 .github/actions/llm/download-llm-binary/action.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/actions/llm/download-llm-binary/action.yml b/.github/actions/llm/download-llm-binary/action.yml
index 107432cf697..c15bfe49f2a 100644
--- a/.github/actions/llm/download-llm-binary/action.yml
+++ b/.github/actions/llm/download-llm-binary/action.yml
@@ -10,7 +10,7 @@ runs:
   using: "composite"
   steps:
     - name: Download all build files
-      uses: actions/download-artifact@v4.1.7
+      uses: actions/download-artifact@v3
     - name: Move build resources
       shell: bash
       run: |