forked from intel/ipex-llm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add vLLM-XPU version's README/examples (intel#9536)
* test * test * fix last kv cache * add xpu readme * remove numactl for xpu example * fix link error * update max_num_batched_tokens logic * add explaination * add xpu environement version requirement * refine gpu memory * fix * fix style
- Loading branch information
Showing
11 changed files
with
287 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# vLLM continuous batching on Intel GPUs (experimental support) | ||
|
||
This example demonstrates how to serve a LLaMA2-7B model using vLLM continuous batching on Intel GPU (with BigDL-LLM low-bits optimizations). | ||
|
||
The code shown in the following example is ported from [vLLM](https://github.com/vllm-project/vllm/tree/v0.2.1.post1). | ||
|
||
## Example: Serving LLaMA2-7B using Intel GPU | ||
|
||
In this example, we will run Llama2-7b model using Arc A770 and provide `OpenAI-compatible` interface for users. | ||
|
||
### 0. Environment | ||
|
||
To use Intel GPUs for deep-learning tasks, you should install the XPU driver and the oneAPI Base Toolkit. Please check the requirements at [here](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU#requirements). | ||
|
||
After install the toolkit, run the following commands in your environment before starting vLLM GPU: | ||
```bash | ||
source /opt/intel/oneapi/setvars.sh | ||
# sycl-ls will list all the compatible Intel GPUs in your environment | ||
sycl-ls | ||
|
||
# Example output with one Arc A770: | ||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] | ||
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] | ||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] | ||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] | ||
``` | ||
|
||
### 1. Install | ||
|
||
To run vLLM continuous batching on Intel GPUs, install the dependencies as follows: | ||
|
||
```bash | ||
# First create an conda environment | ||
conda create -n bigdl-vllm python==3.9 | ||
conda activate bigdl-vllm | ||
# Install dependencies | ||
pip3 install psutil | ||
pip3 install sentencepiece # Required for LLaMA tokenizer. | ||
pip3 install numpy | ||
pip3 install "transformers>=4.33.1" # Required for Code Llama. | ||
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu | ||
pip3 install fastapi | ||
pip3 install "uvicorn[standard]" | ||
pip3 install "pydantic<2" # Required for OpenAI server. | ||
``` | ||
|
||
### 2. Configure recommended environment variables | ||
|
||
```bash | ||
export USE_XETLA=OFF | ||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 | ||
``` | ||
|
||
### 3. Offline inference/Service | ||
|
||
#### Offline inference | ||
|
||
To run offline inference using vLLM for a quick impression, use the following example: | ||
|
||
```bash | ||
#!/bin/bash | ||
|
||
# Please first modify the MODEL_PATH in offline_inference.py | ||
python offline_inference.py | ||
``` | ||
|
||
#### Service | ||
|
||
To fully utilize the continuous batching feature of the `vLLM`, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same `forward` step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. | ||
|
||
```bash | ||
#!/bin/bash | ||
# You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit | ||
# of batched prompt length the server will accept | ||
python -m bigdl.llm.vllm.entrypoints.openai.api_server \ | ||
--model /MODEL_PATH/Llama-2-7b-chat-hf/ --port 8000 \ | ||
--load-format 'auto' --device xpu --dtype bfloat16 \ | ||
--max-num-batched-tokens 4096 | ||
``` | ||
|
||
Then you can access the api server as follows: | ||
|
||
```bash | ||
|
||
curl http://localhost:8000/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "/MODEL_PATH/Llama-2-7b-chat-hf-bigdl/", | ||
"prompt": "San Francisco is a", | ||
"max_tokens": 128, | ||
"temperature": 0 | ||
}' & | ||
``` | ||
|
||
### 4. (Optional) Add a new model | ||
|
||
Currently we have only supported LLaMA family model (including `llama`, `vicuna`, `llama-2`, etc.). To use aother model, you may need add some adaptions. | ||
|
||
#### 4.1 Add model code | ||
|
||
Create or clone the Pytorch model code to `BigDL/python/llm/src/bigdl/llm/vllm/model_executor/models`. | ||
|
||
#### 4.2 Rewrite the forward methods | ||
|
||
Refering to `BigDL/python/llm/src/bigdl/llm/vllm/model_executor/models/bigdl_llama.py`, it's necessary to maintain a `kv_cache`, which is a nested list of dictionary that maps `req_id` to a three-dimensional tensor **(the structure may vary from models)**. Before the model's actual `forward` method, you could prepare a `past_key_values` according to current `req_id`, and after that you need to update the `kv_cache` with `output.past_key_values`. The clearence will be executed when the request is finished. | ||
|
||
#### 4.3 Register new model | ||
|
||
Finally, register your `*ForCausalLM` class to the _MODEL_REGISTRY in `BigDL/python/llm/src/bigdl/llm/vllm/model_executor/model_loader.py`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
# Some parts of this file is adapted from | ||
# https://github.com/vllm-project/vllm/blob/v0.2.1.post1/examples/offline_inference.py | ||
# which is licensed under Apache License 2.0 | ||
# | ||
# Copyright 2023 The vLLM team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from bigdl.llm.vllm.entrypoints.llm import LLM | ||
from bigdl.llm.vllm.sampling_params import SamplingParams | ||
|
||
# Sample prompts. | ||
prompts = [ | ||
"Hello, my name is", | ||
"The president of the United States is", | ||
"The capital of France is", | ||
"The future of AI is", | ||
] | ||
# Create a sampling params object. | ||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) | ||
|
||
# Create an LLM. | ||
# llm = LLM(model="facebook/opt-125m") | ||
llm = LLM(model="YOUR_MODEL_PATH", dtype="bfloat16", device="xpu") | ||
# Generate texts from the prompts. The output is a list of RequestOutput objects | ||
# that contain the prompt, generated text, and other information. | ||
outputs = llm.generate(prompts, sampling_params) | ||
# Print the outputs. | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.