diff --git "a/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Intel13\344\273\243CPU\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221\346\214\207\345\215\227.md" "b/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Intel13\344\273\243CPU\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221\346\214\207\345\215\227.md" new file mode 100644 index 0000000..9537f8a --- /dev/null +++ "b/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Intel13\344\273\243CPU\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221\346\214\207\345\215\227.md" @@ -0,0 +1,280 @@ +# 基于Intel 13代CPU的大语言模型应用开发指南 + +本文档介绍如何开发大语言模型应用UI,基于开源的intel bigdl-llm库和gradio。UI跑在windows11 x86 CPU上,实现在PC 16GB内存上运行优化的Native INT4 大语言模型。以三个大语言模型为例,ChatGLM2 (6B)中英,LLaMA2 (13B)英,StarCoder (15.5B)中英。 +## 1 安装环境 +(1)Windows11安装Miniconda3-py39_23.5.2-0-Windows-x86_64.exe,下载链接: +https://docs.conda.io/en/latest/miniconda.html#windows-installers + +(2)打开Anaconda Powershell Prompt窗口 +``` + conda create -n llm python=3.9 + conda activate llm + pip install --pre --upgrade bigdl-llm[all] + pip install gradio==3.41.1 mdtex2html +``` +或者用指定版本的方式安装 +``` + pip install --pre bigdl-llm[all]==2.4.0b20230820 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +## 2 LLM模型转换 +以Chatglm2,llama2,starcoder为例,下载hugging face FP16模型。模型下载链接: + +· ChatGLM2-6B:https://huggingface.co/THUDM/chatglm2-6b/tree/main + +· Llama2-13B: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/tree/main + +· StarCoder: https://huggingface.co/bigcode/starcoder/tree/main + +### 2.1 FP16转Native INT4模型,并调用python函数 (推荐运行在CPU) +Chatglm2 ,llama2,starcoder转native INT4。 + +打开Anaconda PowerShell,修改模型路径和输出文件夹名称,并运行: +``` + conda activate llm + llm-convert "C:/llm-models/chatglm2-6b/" --model-format pth --model-family "chatglm" --outfile "checkpoint/" + llm-convert "C:/llm-models/llama-2-13b-chat-hf/" --model-format pth --model-family "llama" --outfile "checkpoint/" + llm-convert "C:/llm-models/starcoder/" --model-format pth --model-family "starcoder" --outfile "checkpoint/" +``` +Note:starcoder用16GB内存的机器转不了Native INT4,因为内存不够。建议转starcoder native INT4用更大的内存的机器。 + +#### python调用Native INT4模型。 +参数解释: + +(1)n_threads=CPU大核数*2+小核数 或者 + +n_threads=CPU大核数*2+小核数 - 1 或者 + +n_threads=CPU大核数*2+小核数 -2 + +不同设备可以尝试这3个参数,选择一个最优参数。 + +(2)n_ctx=4096表示模型最长的输入+输出文本等于4096 tokens +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +from bigdl.llm.transformers import BigdlNativeForCausalLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) + +elif model_name == "llama2-13b": + model = BigdlNativeForCausalLM.from_pretrained( + pretrained_model_name_or_path=model_all_local_path + "\\bigdl_llm_llama2_13b_q4_0.bin", + model_family='llama',n_threads=20,n_ctx=4096) +elif model_name == "StarCoder": + model = BigdlNativeForCausalLM.from_pretrained( + pretrained_model_name_or_path=model_all_local_path + "\\bigdl_llm_starcoder_q4_0.bin", + model_family='starcoder',n_threads=20,n_ctx=4096) +``` +### 2.2 FP16转transformer INT4,并调用python函数 +Transformer INT4在CPU上运行性能比Native INT4低一些。 + +用python脚本转换模型为transformer INT4 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer +from bigdl.llm.transformers import AutoModelForCausalLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name + +if model_name == "chatglm2-6b": + tokenizer = AutoTokenizer.from_pretrained(model_name_local, trust_remote_code=True) + model = AutoModel.from_pretrained(model_name_local, trust_remote_code=True, load_in_4bit=True) + model.save_low_bit("D:\\llm-models\\chatglm2-6b-int4\\") + tokenizer.save_pretrained("D:\\llm-models\\chatglm2-6b-int4\\") + +elif model_name == "llama2-13b" or model_name == "StarCoder": + tokenizer = AutoTokenizer.from_pretrained(model_name_local, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(model_name_local, trust_remote_code=True, load_in_4bit=True) + model.save_low_bit("D:\\llm-models\\"+model_name) + tokenizer.save_pretrained("D:\\llm-models\\"+model_name) +``` +python调用transformer INT4模型 +``` +if model_name == "chatglm2-6b": + model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) + model = model.eval() +elif model_name == "llama2-13b" or model_name == "StarCoder": + model = AutoModelForCausalLM.load_low_bit(model_name_local,trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) + model = model.eval() +``` +## 3 测试LLM benchmark on CPU +使用Native INT4模型测试LLM benchmark on CPU将会使用所有核,方便和应用UI的性能指标相比较。 + +打开Anaconda PowerShell Prompt +``` + conda activate llm +#ChatGLM2: +$ llm-cli -t 20 -x chatglm -m "ggml-chatglm2-6b-q4_0.bin" -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" --no-mmap -v -n 32 +#Llama2: +$ llm-cli -t 20 -x llama -m "bigdl_llm_llama2_13b_q4_0.bin" -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" --no-mmap -n 32 +#Starcoder: +$ llm-cli -t 20 -x starcoder -m "bigdl_llm_starcoder_q4_0.bin" -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" --no-mmap -n 32 +``` +参数解释:-n 32限制输出32 tokens。 + +从command line提取性能信息,如图: + +Input token: 32 tokens + +Output token: 32 tokens (31 runs = 32 tokens – 1st token) + +1st token avg latency (ms) = 1541.56 ms + +2nd+ token avg latency (ms/token) = 125.62 ms per token + +图1:llm-cli的输出 + +## 4 用吐字的方式输出文本 +### 4.1(推荐运行在CPU):Native int4 for chatglm2,llama2和starcoder。 +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +prompt = "What is AI?" +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) + response = "" + for chunk in model(prompt, temperature=0.95,top_p=0.8,stream=True,max_tokens=512): + response += chunk['choices'][0]['text'] +``` +llama2和starcoder的吐字调用方式相同,也是用for循环。 + +参数说明: + +· 温度(Temperature)(数值越高,输出的随机性增加),可调范围0~1 + +· Top P(数值越高,词语选择的多样性增加),可调范围0~1 + +· 输出最大长度(Max Length)(输出文本的最大tokens),可调范围0~2048,上限由模型决定。这三个模型n_ctx最大8k,输入+输出tokens应小于8k。 + +### 4.2 Transformer INT4 stream_chat仅限chatglm2 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer +import torch +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name +prompt = "What is AI?" +model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) +model = model.eval() +with torch.inference_mode(): + for response, history in model.stream_chat(tokenizer, prompt, history,max_length=512, top_p=0.9,temperature=0.9): + print(response) +``` +### 4.3 Transformer INT4 TextIteratorStreamer for chatglm2,llama2和starcoder。 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer,TextIteratorStreamer +import torch +from benchmark_util import BenchmarkWrapper + +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name +model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) +model = model.eval() +prompt = "What is AI?" +with torch.inference_mode(): + model=BenchmarkWrapper(model) + inputs = tokenizer(prompt, return_tensors="pt") + streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) + response = "" + timeStart = time.time() + # out = model.generate(**inputs, streamer=streamer, temperature=0.9, top_p=0.9, max_new_tokens=512) + generate_kwargs = dict(**inputs,streamer=streamer,temperature=0.9, top_p=0.9, max_new_tokens=512) + from threading import Thread + thread = Thread(target=model.generate, kwargs=generate_kwargs) + thread.start() + + for new_text in streamer: + response += new_text + timeCost = time.time() - timeStart + token_count_input = len(tokenizer.tokenize(prompt)) + +``` + +## 5 添加history多轮对话功能 +### 5.1 仅对chatglm2 Transformer INT4 stream_chat +代码参考4.2 +### 5.2对于Native int4添加history多轮对话功能 +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +history_round = 0 +history = [] +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) +input = "你好" +predict(input) +input = "请进行丽江三天必游景点旅游规划" +predict(input) + +def predict(input): + global history_round, model, history + response = "" + if len(model.tokenize(history)) > 2500 or history_round >= 5: ### history record 5 rounds + history_round = 0 + history = [] + print("*********** reset chatbot and history", history) + + if len(history) == 0: + print("*********** new chat ") + prompt = input + history = prompt + history_round = 1 + else: + prompt = history + '\n' + input + history_round += 1 + print("******************* history_round ", history_round) + + timeStart = time.time() + for chunk in model(prompt, temperature=0.9,top_p=0.9, stream=True,max_tokens=512): + response += chunk['choices'][0]['text'] + history = prompt + response + print("******** max_length history",len(model.tokenize(history))) +``` +### 5.3 对于transformer INT4 TextIteratorStreamer同5.2 +## 6 用gradio写Web UI +下载代码:https://github.com/KiwiHana/LLM_UI_Windows_CPU + +![image](https://github.com/KiwiHana/bigdl-llm-tutorial/assets/102839943/5a399c7e-31b4-4337-a6a4-bc6f8bccb93c) +图2:LLM_UI_Windows_CPU界面 + + +为了使用全部核,用管理员打开Anaconda Powershell Prompt窗口,运行LLM_demo_v1.0.py 或 LLM_demo_v2.0.py。 +``` +git clone https://github.com/KiwiHana/LLM_UI_Windows_CPU.git +cd LLM_UI_Windows_CPU +conda activate llm +python LLM_demo_v1.0.py +``` +Note: 修改LLM_demo_v1.0.py脚本第285行 main函数里的模型存放路径, + +例如 model_all_local_path = "C:/Users/username/checkpoint/" + +· 大语言模型应用UIv1.0文件夹应包含: + +LLM_demo_v1.0.py + +theme3.json + +checkpoint + +-- bigdl_llm_llama2_13b_q4_0.bin + +-- bigdl_llm_starcoder_q4_0.bin + +-- ggml-chatglm2-6b-q4_0.bin + + +参考链接: + +https://github.com/intel-analytics/bigdl-llm-tutorial/tree/main/ch_2_Environment_Setup diff --git "a/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Xeon\345\222\214SPR\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221.md" "b/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Xeon\345\222\214SPR\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221.md" new file mode 100644 index 0000000..fdd1bdc --- /dev/null +++ "b/Chinese_Version /ch_8_Applications/\345\237\272\344\272\216Xeon\345\222\214SPR\347\232\204\345\244\247\350\257\255\350\250\200\346\250\241\345\236\213\345\272\224\347\224\250\345\274\200\345\217\221.md" @@ -0,0 +1,315 @@ +# 基于Intel Xeon和SPR的大语言模型应用开发指南 + +本文档介绍如何开发大语言模型应用UI,基于开源的intel bigdl-llm库和gradio。 +UI跑在windows11 x86 CPU上或者Ubuntu CPU上,实现在6根16GB内存以上运行优化的Native INT4 大语言模型。 +以三个大语言模型为例,ChatGLM2 (6B)中英,LLaMA2 (13B)英。 + +Note: 通常客户服务器出货系统,CPU的内存槽位不会插满,这种情况下必须要根据实际条数插到特定的槽位,否则会影响到系统稳定性与性能。以下是Eagle Stream内存插法示意图。 +![image](https://github.com/KiwiHana/bigdl-llm-tutorial/assets/102839943/a54c74cc-6581-4f9e-b2b4-3780bbcfe2a6) + + +## 1 安装环境 +(1)Windows11安装Miniconda3-py39_23.5.2-0-Windows-x86_64.exe,下载链接: +https://docs.conda.io/en/latest/miniconda.html#windows-installers + +如果是Ubuntu,下载安装 +``` +wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.5.2-0-Linux-x86_64.sh +chmod -R 777 Miniconda3-py39_23.5.2-0-Linux-x86_64.shy +./Miniconda3-py39_23.5.2-0-Linux-x86_64.sh +sudo apt install numactl +``` + +(2)打开Anaconda Powershell Prompt窗口 +``` +conda create -n llm python=3.9 +conda activate llm +pip install --pre --upgrade bigdl-llm[all] +pip install bigdl-nano +pip install gradio==3.41.1 mdtex2html +``` +或者用指定版本的方式安装 +``` + pip install --pre bigdl-llm[all]==2.4.0b20231110 -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +## 2 LLM模型转换 +以Chatglm2,llama2,starcoder为例,下载hugging face FP16模型。模型下载链接: + +· ChatGLM2-6B:https://huggingface.co/THUDM/chatglm2-6b/tree/main + +· Llama2-13B: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/tree/main + + +### 2.1 FP16转Native INT4模型,并调用python函数 (推荐运行在CPU) +Chatglm2 ,llama2 转native INT4。 + +打开Anaconda PowerShell,修改模型路径和输出文件夹名称,并运行: +``` + conda activate llm + llm-convert "/llm-models/chatglm2-6b/" --model-format pth --model-family "chatglm" --outfile "checkpoint/" + llm-convert "/llm-models/llama-2-13b-chat-hf/" --model-format pth --model-family "llama" --outfile "checkpoint/" +``` + + +#### python调用Native INT4模型。 +参数解释: + +(1)n_threads +对于Xeon,OMP_NUM_THREADS 和 n_threads是第一个socket的物理核数量。 +对于有2个socket的SPR,需要指定用第一个socket所有物理核进行推理。 + +(2)n_ctx=4096表示模型最长的输入+输出文本等于4096 tokens +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +from bigdl.llm.transformers import BigdlNativeForCausalLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) + +elif model_name == "llama2-13b": + model = BigdlNativeForCausalLM.from_pretrained( + pretrained_model_name_or_path=model_all_local_path + "\\bigdl_llm_llama2_13b_q4_0.bin", + model_family='llama',n_threads=20,n_ctx=4096) +``` +### 2.2 FP16转transformer INT4,并调用python函数 +Transformer INT4在CPU上运行性能比Native INT4低一些。 + +用python脚本转换模型为transformer INT4 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer +from bigdl.llm.transformers import AutoModelForCausalLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name + +if model_name == "chatglm2-6b": + tokenizer = AutoTokenizer.from_pretrained(model_name_local, trust_remote_code=True) + model = AutoModel.from_pretrained(model_name_local, trust_remote_code=True, load_in_4bit=True) + model.save_low_bit("D:\\llm-models\\chatglm2-6b-int4\\") + tokenizer.save_pretrained("D:\\llm-models\\chatglm2-6b-int4\\") + +elif model_name == "llama2-13b": + tokenizer = AutoTokenizer.from_pretrained(model_name_local, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(model_name_local, trust_remote_code=True, load_in_4bit=True) + model.save_low_bit("D:\\llm-models\\"+model_name) + tokenizer.save_pretrained("D:\\llm-models\\"+model_name) +``` +python调用transformer INT4模型 +``` +if model_name == "chatglm2-6b": + model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) + model = model.eval() +elif model_name == "llama2-13b": + model = AutoModelForCausalLM.load_low_bit(model_name_local,trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) + model = model.eval() +``` +## 3 测试LLM benchmark on CPU +使用Native INT4模型测试LLM benchmark on CPU将会使用所有核,方便和应用UI的性能指标相比较。 + +对于Ubuntu平台,假设SPR第一个socket有48个物理核。numactl -C 0-47 -m 0 $command +``` +$ lscpu +NUMA node0 CPU(s): 0-47,96-143 +NUMA node1 CPU(s): 48-95,144-191 + +Therefore, you will set parameters like: +$ export OMP_NUM_THREADS=48 +$ numactl -C 0-47 -m 0 llm-cli -t 48 …… +``` + +``` +sudo apt install numactl +conda create -n llm python=3.9 +conda activate llm +pip install bigdl-llm[all] +pip install bigdl-nano +source bigdl-nano-init -c +export OMP_NUM_THREADS=48 +export TRANSFORMERS_OFFLINE=1 +$ numactl -C 0-47 -m 0 llm-cli -t 48 -x chatglm -m "./checkpoint/bigdl_llm_chatglm_q4_0.bin" -p "Once upon +a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new +people, and have fun" --no-mmap -v -n 32 + +numactl -C 0-47 -m 0 llm-cli -t 48 -x llama -m "bigdl_llm_llama2_13b_q4_0.bin" -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" --no-mmap -n 32 +``` + +对于windows平台,假设SPR第一个socket有48个物理核。start /node 0 $command +``` +conda create -n llm python=3.9 +conda activate llm +pip install bigdl-llm[all] +pip install bigdl-nano + +start /node 0 llm-cli -t 48 -x chatglm -m "./checkpoint/bigdl_llm_chatglm_q4_0.bin" -p "Once upon +a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new +people, and have fun" --no-mmap -v -n 32 +``` + +参数解释:-n 32限制输出32 tokens。 + +从command line提取性能信息,如图: + +Input token: 32 tokens + +Output token: 32 tokens (31 runs = 32 tokens – 1st token) + +1st token avg latency (ms) = 1541.56 ms + +2nd+ token avg latency (ms/token) = 125.62 ms per token + +图1:llm-cli的输出 +![image](https://github.com/KiwiHana/bigdl-llm-tutorial/assets/102839943/5adf144a-5fc5-432f-b476-f26d341fbced) + + +## 4 用吐字的方式输出文本 +### 4.1(推荐运行在CPU):Native int4 for chatglm2,llama2和starcoder。 +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +prompt = "What is AI?" +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) + response = "" + for chunk in model(prompt, temperature=0.95,top_p=0.8,stream=True,max_tokens=512): + response += chunk['choices'][0]['text'] +``` +llama2和starcoder的吐字调用方式相同,也是用for循环。 + +参数说明: + +· 温度(Temperature)(数值越高,输出的随机性增加),可调范围0~1 + +· Top P(数值越高,词语选择的多样性增加),可调范围0~1 + +· 输出最大长度(Max Length)(输出文本的最大tokens),可调范围0~2048,上限由模型决定。这三个模型n_ctx最大8k,输入+输出tokens应小于8k。 + +### 4.2 Transformer INT4 stream_chat仅限chatglm2 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer +import torch +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name +prompt = "What is AI?" +model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) +model = model.eval() +with torch.inference_mode(): + for response, history in model.stream_chat(tokenizer, prompt, history,max_length=512, top_p=0.9,temperature=0.9): + print(response) +``` +### 4.3 Transformer INT4 TextIteratorStreamer for chatglm2,llama2和starcoder。 +``` +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer,TextIteratorStreamer +import torch +from benchmark_util import BenchmarkWrapper + +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +model_name_local = model_all_local_path + model_name +model = AutoModel.load_low_bit(model_name_local,trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained(model_name_local,trust_remote_code=True) +model = model.eval() +prompt = "What is AI?" +with torch.inference_mode(): + model=BenchmarkWrapper(model) + inputs = tokenizer(prompt, return_tensors="pt") + streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) + response = "" + timeStart = time.time() + # out = model.generate(**inputs, streamer=streamer, temperature=0.9, top_p=0.9, max_new_tokens=512) + generate_kwargs = dict(**inputs,streamer=streamer,temperature=0.9, top_p=0.9, max_new_tokens=512) + from threading import Thread + thread = Thread(target=model.generate, kwargs=generate_kwargs) + thread.start() + + for new_text in streamer: + response += new_text + timeCost = time.time() - timeStart + token_count_input = len(tokenizer.tokenize(prompt)) + +``` + +## 5 添加history多轮对话功能 +### 5.1 仅对chatglm2 Transformer INT4 stream_chat +代码参考4.2 +### 5.2对于Native int4添加history多轮对话功能 +``` +from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM +model_name = "chatglm2-6b" +model_all_local_path = "C:\\PC_LLM\\checkpoint\\" +history_round = 0 +history = [] +if model_name == "chatglm2-6b": + model = ChatGLM(model_all_local_path + "\\ggml-chatglm2-6b-q4_0.bin", n_threads=20,n_ctx=4096) +input = "你好" +predict(input) +input = "请进行丽江三天必游景点旅游规划" +predict(input) + +def predict(input): + global history_round, model, history + response = "" + if len(model.tokenize(history)) > 2500 or history_round >= 5: ### history record 5 rounds + history_round = 0 + history = [] + print("*********** reset chatbot and history", history) + + if len(history) == 0: + print("*********** new chat ") + prompt = input + history = prompt + history_round = 1 + else: + prompt = history + '\n' + input + history_round += 1 + print("******************* history_round ", history_round) + + timeStart = time.time() + for chunk in model(prompt, temperature=0.9,top_p=0.9, stream=True,max_tokens=512): + response += chunk['choices'][0]['text'] + history = prompt + response + print("******** max_length history",len(model.tokenize(history))) +``` +### 5.3 对于transformer INT4 TextIteratorStreamer同5.2 +## 6 用gradio写Web UI +下载代码:https://github.com/KiwiHana/LLM_UI_Windows_CPU + +![image](https://github.com/KiwiHana/bigdl-llm-tutorial/assets/102839943/5a399c7e-31b4-4337-a6a4-bc6f8bccb93c) +图2:LLM_UI_Windows_CPU界面 + + +为了使用全部核,用管理员打开Anaconda Powershell Prompt窗口,运行LLM_demo_v1.0.py 或 LLM_demo_v2.0.py。 +``` +git clone https://github.com/KiwiHana/LLM_UI_Windows_CPU.git +cd LLM_UI_Windows_CPU +conda activate llm +python LLM_demo_v1.0.py +``` +Note: 修改LLM_demo_v1.0.py脚本第285行 main函数里的模型存放路径, + +例如 model_all_local_path = "C:/Users/username/checkpoint/" + +· 大语言模型应用UIv1.0文件夹应包含: + +LLM_demo_v1.0.py + +theme3.json + +checkpoint + +-- bigdl_llm_llama2_13b_q4_0.bin + +-- ggml-chatglm2-6b-q4_0.bin + + +参考链接: + +https://github.com/intel-analytics/bigdl-llm-tutorial/tree/main/ch_2_Environment_Setup