update

intel-analytics · Oct 27, 2023 · ed35455 · ed35455
1 parent 205735d
commit ed35455
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 1 deletion.
diff --git a/python/llm/example/CPU/applications/streaming-llm/README.md b/python/llm/example/CPU/applications/streaming-llm/README.md
@@ -1,6 +1,6 @@
 # Low-Bit Streaming LLM using BigDL-LLM
 
-In this example, we apply [Streaming-LLM](https://github.com/mit-han-lab/streaming-llm/tree/main#efficient-streaming-language-models-with-attention-sinks) using BigDL-LLM, which can deploy low-bit(including INT8/INT5/INT4) LLMs for infinite-length inputs.
+In this example, we apply [Streaming-LLM](https://github.com/mit-han-lab/streaming-llm/tree/main#efficient-streaming-language-models-with-attention-sinks) using BigDL-LLM, which can deploy low-bit(including FP4/INT4/FP8/INT8) LLMs for infinite-length inputs.
 Only one code change is needed to load the model using bigdl-llm as follows:
 ```python
 from bigdl.llm.transformers import AutoModelForCausalLM

diff --git a/python/llm/example/CPU/applications/streaming-llm/streaming_llm/utils.py b/python/llm/example/CPU/applications/streaming-llm/streaming_llm/utils.py
@@ -48,6 +48,7 @@
 import urllib.request
 import os
 import json
+# code change to import from bigdl-llm API instead of using transformers API
 from bigdl.llm.transformers import AutoModelForCausalLM
 from transformers import LlamaTokenizer
 import intel_extension_for_pytorch as ipex
@@ -61,6 +62,8 @@ def load(model_name_or_path):
         trust_remote_code=True,
     )
 
+# set load_in_4bit=True to get performance boost, set optimize_model=False for now
+# TODO align logics of optimize_model and streaming
     model = AutoModelForCausalLM.from_pretrained(
         model_name_or_path,
         load_in_4bit=True,