You can use IPEX-LLM to load any Hugging Face transformers model for acceleration on Intel GPUs. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) hosted on Hugging Face can be loaded and optimized automatically on Intel GPUs with low-bit quantization (supported precisions include INT4/NF4/INT5/INT8).
In this tutorial, you will learn how to run LLMs on Intel GPUs with IPEX-LLM optimizations, and based on that build a stream chatbot. A popular open-source LLM meta-llama/Llama-2-7b-chat-hf is used as an example.
First of all, install IPEX-LLM in your prepared environment. For best practices of environment setup on Intel GPUs, refer to the README in this chapter.
In terminal, run:
pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
Note If you are using an older version of
ipex-llm
(specifically, older than2.5.0b20240104
), you need to manually addimport intel_extension_for_pytorch as ipex
at the beginning of your code.
It is also required to set oneAPI environment variables for IPEX-LLM on Intel GPUs.
# configure oneAPI environment variables
source /opt/intel/oneapi/setvars.sh
After installation and environment setup, let's move to the Python scripts of this tutorial.
To download the meta-llama/Llama-2-7b-chat-hf model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided here to request access to the model.
After receiving the access, download the model with your Hugging Face token:
from huggingface_hub import snapshot_download
model_path = snapshot_download(repo_id='meta-llama/Llama-2-7b-chat-hf',
token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token
Note The model will by default be downloaded to
HF_HOME='~/.cache/huggingface'
.
One common use case is to load a Hugging Face transformers model in low precision, i.e. conduct implicit quantization while loading.
For Llama 2 (7B), you could simply import ipex_llm.transformers.AutoModelForCausalLM
instead of transformers.AutoModelForCausalLM
, and specify load_in_4bit=True
or load_in_low_bit
parameter accordingly in the from_pretrained
function.
For Intel GPUs, once you have the model in low precision, set it to to('xpu')
.
For INT4 Optimizations (with load_in_4bit=True
):
from ipex_llm.transformers import AutoModelForCausalLM
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
load_in_4bit=True)
model_in_4bit_gpu = model_in_4bit.to('xpu')
Note IPEX-LLM has supported
AutoModel
,AutoModelForCausalLM
,AutoModelForSpeechSeq2Seq
andAutoModelForSeq2SeqLM
.If you have already downloaded the Llama 2 (7B) model and skipped step 7.1.2.2, you could specify
pretrained_model_name_or_path
to the model path.
(Optional) For INT8 Optimizations (with load_in_low_bit="sym_int8"
):
from ipex_llm.transformers import AutoModelForCausalLM
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model_in_8bit = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
load_in_low_bit="sym_int8"
)
model_in_8bit_gpu = model_in_8bit.to('xpu')
Note
Currently,
load_in_low_bit
supports options'sym_int4'
,'asym_int4'
,'sym_int5'
,'asym_int5'
or'sym_int8'
, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Option'nf4'
is also supported, referring to 4-bit NormalFloat. Floating point precisions'fp4'
,'fp8'
,'fp16'
and mixed precisions including'mixed_fp4'
and'mixed_fp8'
are also supported.
load_in_4bit=True
is equivalent toload_in_low_bit='sym_int4'
.
A tokenizer is also needed for LLM inference. You can use Huggingface transformers API to load the tokenizer directly. It can be used seamlessly with models loaded by IPEX-LLM. For Llama 2, the corresponding tokenizer class is LlamaTokenizer
.
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf")
Note If you have already downloaded the Llama 2 (7B) model and skipped step 7.1.2.2, you could specify
pretrained_model_name_or_path
to the model path.
You can then do model inference with IPEX-LLM optimizations on Intel GPUs almostly the same way as using official transformers
API. The only difference is to set to('xpu')
for token ids. A Q&A dialog template is created for the model to complete.
import torch
with torch.inference_mode():
prompt = 'Q: What is CPU?\nA:'
# tokenize the input prompt from string to token ids;
# with .to('xpu') specifically for inference on Intel GPUs
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# predict the next tokens (maximum 32) based on the input token ids
output = model_in_4bit_gpu.generate(input_ids,
max_new_tokens=32)
# decode the predicted token ids to output string
output = output.cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print('-'*20, 'Output', '-'*20)
print(output_str)
Note The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's advisable to perform a warm-up run before the actual generation.
For the next section of stream chat, we could treat this time of generation in section 7.1.6 as a warm-up.
Now, let's build a stream chatbot that runs on Intel GPUs, allowing LLMs to engage in interactive conversations. Chatbot interaction is no magic - it still relies on the prediction and generation of next tokens by LLMs. To make LLMs chat, we need to properly format the prompts into a conversation format, for example:
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.
<</SYS>>
What is AI? [/INST]
Further, to enable a multi-turn chat experience, you need to append the new dialog input to the previous conversation to make a new prompt for the model, for example:
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.
<</SYS>>
What is AI? [/INST] AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images. </s><s> [INST] Is it dangerous? [INST]
Here we show a multi-turn chat example with stream capability on IPEX-LLM optimized Llama 2 (7B) model.
First, define the conversation context format1 for the model to complete:
SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe."
def format_prompt(input_str, chat_history):
prompt = [f'<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n']
do_strip = False
for history_input, history_response in chat_history:
history_input = history_input.strip() if do_strip else history_input
do_strip = True
prompt.append(f'{history_input} [/INST] {history_response.strip()} </s><s>[INST] ')
input_str = input_str.strip() if do_strip else input_str
prompt.append(f'{input_str} [/INST]')
return ''.join(prompt)
Next, define the stream_chat
function, which continuously adds model outputs to the chat history. This ensures that conversation context can be properly formatted for next generation of responses. Here, the response is generated in a streaming (word-by-word) way:
from transformers import TextIteratorStreamer
def stream_chat(model, tokenizer, input_str, chat_history):
# format conversation context as prompt through chat history
prompt = format_prompt(input_str, chat_history)
input_ids = tokenizer([prompt], return_tensors='pt').to('xpu') # specify to('xpu') for Intel GPUs
streamer = TextIteratorStreamer(tokenizer,
skip_prompt=True, # skip prompt in the generated tokens
skip_special_tokens=True)
generate_kwargs = dict(
input_ids,
streamer=streamer,
max_new_tokens=128
)
# to ensure non-blocking access to the generated text, generation process should be ran in a separate thread
from threading import Thread
thread = Thread(target=model.generate, kwargs=generate_kwargs)
thread.start()
output_str = []
print("Response: ", end="")
for stream_output in streamer:
output_str.append(stream_output)
print(stream_output, end="")
# add model output to the chat history
chat_history.append((input_str, ''.join(output_str)))
Note To successfully observe the text streaming behavior in standard output, we need to set the environment variable
PYTHONUNBUFFERED=1
to ensure that the standard output streams are directly sent to the terminal without being buffered first.The Hugging Face transformers streamer classes is currently being developed and is subject to future changes.
We can then achieve interactive, multi-turn stream chat between humans and the bot by allowing continuous user input:
chat_history = []
print('-'*20, 'Stream Chat', '-'*20, end="")
while True:
with torch.inference_mode():
print("\n", end="")
user_input = input("Input: ")
if user_input == "stop": # let's stop the conversation when user input "stop"
print("Stream Chat with Llama 2 (7B) stopped.")
break
stream_chat(model=model_in_4bit_gpu,
tokenizer=tokenizer,
input_str=user_input,
chat_history=chat_history)