-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[UPDATE] add prefix caching document into
vllm_docker_quickstart.md
(…
…#12173) * [ADD] rewrite new vllm docker quick start * [ADD] lora adapter doc finished * [ADD] mulit lora adapter test successfully * [ADD] add ipex-llm quantization doc * [Merge] rebase main * [REMOVE] rm tmp file * [Merge] rebase main * [ADD] add prefix caching experiment and result * [REMOVE] rm cpu offloading chapter
- Loading branch information
Showing
1 changed file
with
102 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,7 +65,7 @@ We have included multiple vLLM-related files in `/llm/`: | |
|
||
|parameters|explanation| | ||
|:---|:---| | ||
|`model="YOUR_MODEL"`| the model path in docker, for example "/llm/models/Llama-2-7b-chat-hf"| | ||
|`model="YOUR_MODEL"`| the model path in docker, for example `"/llm/models/Llama-2-7b-chat-hf"`| | ||
|`load_in_low_bit="fp8"`| model quantization accuracy, acceptable ``'sym_int4'``, ``'asym_int4'``, ``'fp6'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'``; ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means asymmetric int 4, etc. Relevant low bit optimizations will be applied to the model. default is ``'fp8'``, which is the same as ``'fp8_e5m2'``| | ||
|`tensor_parallel_size=1`| number of tensor parallel replicas, default is `1`| | ||
|`pipeline_parallel_size=1`| number of pipeline stages, default is `1`| | ||
|
@@ -382,7 +382,107 @@ curl http://localhost:8000/v1/chat/completions \ | |
{"id":"chat-0c8ea64a2f8e42d9a8f352c160972455","object":"chat.completion","created":1728373105,"model":"MiniCPM-V-2_6","choices":[{"index":0,"message":{"role":"assistant","content":"这幅图片展示了一个小孩,可能是女孩,根据服装和发型来判断。她穿着一件有红色和白色条纹的连衣裙,一个可见的白色蝴蝶结,以及一个白色的 头饰,上面有红色的点缀。孩子右手拿着一个白色泰迪熊,泰迪熊穿着一个粉色的裙子,带有褶边,它的左脸颊上有一个红色的心形图案。背景模糊,但显示出一个自然户外的环境,可能是一个花园或庭院,有红花和石头墙。阳光照亮了整个场景,暗示这可能是正午或下午。整体氛围是欢乐和天真。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":225,"total_tokens":353,"completion_tokens":128}} | ||
``` | ||
|
||
#### Preifx Caching[todo] | ||
#### Preifx Caching | ||
|
||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. | ||
|
||
1. Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example python script to show the time reduce of APC: | ||
|
||
```python | ||
import time | ||
from vllm import SamplingParams | ||
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM | ||
# A prompt containing a large markdown table. The table is randomly generated by GPT-4. | ||
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """ | ||
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address | | ||
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------| | ||
| 1 | John Doe | 29 | Engineer | USA | [email protected] | 555-1234 | 123 Elm St, Springfield, IL | | ||
| 2 | Jane Smith | 34 | Doctor | Canada | [email protected] | 555-5678 | 456 Oak St, Toronto, ON | | ||
| 3 | Alice Johnson | 27 | Teacher | UK | [email protected] | 555-8765 | 789 Pine St, London, UK | | ||
| 4 | Bob Brown | 45 | Artist | Australia | [email protected] | 555-4321 | 321 Maple St, Sydney, NSW | | ||
| 5 | Carol White | 31 | Scientist | New Zealand | [email protected] | 555-6789 | 654 Birch St, Wellington, NZ | | ||
| 6 | Dave Green | 28 | Lawyer | Ireland | [email protected] | 555-3456 | 987 Cedar St, Dublin, IE | | ||
| 7 | Emma Black | 40 | Musician | USA | [email protected] | 555-1111 | 246 Ash St, New York, NY | | ||
| 8 | Frank Blue | 37 | Chef | Canada | [email protected] | 555-2222 | 135 Spruce St, Vancouver, BC | | ||
| 9 | Grace Yellow | 50 | Engineer | UK | [email protected] | 555-3333 | 864 Fir St, Manchester, UK | | ||
| 10 | Henry Violet | 32 | Artist | Australia | [email protected] | 555-4444 | 753 Willow St, Melbourne, VIC| | ||
| 11 | Irene Orange | 26 | Scientist | New Zealand | [email protected] | 555-5555 | 912 Poplar St, Auckland, NZ | | ||
| 12 | Jack Indigo | 38 | Teacher | Ireland | [email protected] | 555-6666 | 159 Elm St, Cork, IE | | ||
| 13 | Karen Red | 41 | Lawyer | USA | [email protected] | 555-7777 | 357 Cedar St, Boston, MA | | ||
| 14 | Leo Brown | 30 | Chef | Canada | [email protected] | 555-8888 | 246 Oak St, Calgary, AB | | ||
| 15 | Mia Green | 33 | Musician | UK | [email protected] | 555-9999 | 975 Pine St, Edinburgh, UK | | ||
| 16 | Noah Yellow | 29 | Doctor | Australia | [email protected] | 555-0000 | 864 Birch St, Brisbane, QLD | | ||
| 17 | Olivia Blue | 35 | Engineer | New Zealand | [email protected] | 555-1212 | 753 Maple St, Hamilton, NZ | | ||
| 18 | Peter Black | 42 | Artist | Ireland | [email protected] | 555-3434 | 912 Fir St, Limerick, IE | | ||
| 19 | Quinn White | 28 | Scientist | USA | [email protected] | 555-5656 | 159 Willow St, Seattle, WA | | ||
| 20 | Rachel Red | 31 | Teacher | Canada | [email protected] | 555-7878 | 357 Poplar St, Ottawa, ON | | ||
| 21 | Steve Green | 44 | Lawyer | UK | [email protected] | 555-9090 | 753 Elm St, Birmingham, UK | | ||
| 22 | Tina Blue | 36 | Musician | Australia | [email protected] | 555-1213 | 864 Cedar St, Perth, WA | | ||
| 23 | Umar Black | 39 | Chef | New Zealand | [email protected] | 555-3435 | 975 Spruce St, Christchurch, NZ| | ||
| 24 | Victor Yellow | 43 | Engineer | Ireland | [email protected] | 555-5657 | 246 Willow St, Galway, IE | | ||
| 25 | Wendy Orange | 27 | Artist | USA | [email protected] | 555-7879 | 135 Elm St, Denver, CO | | ||
| 26 | Xavier Green | 34 | Scientist | Canada | [email protected] | 555-9091 | 357 Oak St, Montreal, QC | | ||
| 27 | Yara Red | 41 | Teacher | UK | [email protected] | 555-1214 | 975 Pine St, Leeds, UK | | ||
| 28 | Zack Blue | 30 | Lawyer | Australia | [email protected] | 555-3436 | 135 Birch St, Adelaide, SA | | ||
| 29 | Amy White | 33 | Musician | New Zealand | [email protected] | 555-5658 | 159 Maple St, Wellington, NZ | | ||
| 30 | Ben Black | 38 | Chef | Ireland | [email protected] | 555-7870 | 246 Fir St, Waterford, IE | | ||
""" | ||
def get_generation_time(llm, sampling_params, prompts): | ||
# time the generation | ||
start_time = time.time() | ||
output = llm.generate(prompts, sampling_params=sampling_params) | ||
end_time = time.time() | ||
# print the output and generation time | ||
print(f"Output: {output[0].outputs[0].text}") | ||
print(f"Generation time: {end_time - start_time} seconds.") | ||
# set enable_prefix_caching=True to enable APC | ||
llm = LLM(model='/llm/models/Llama-2-7b-chat-hf', | ||
device="xpu", | ||
dtype="float16", | ||
enforce_eager=True, | ||
load_in_low_bit="fp8", | ||
tensor_parallel_size=1, | ||
max_model_len=2000, | ||
max_num_batched_tokens=2000, | ||
enable_prefix_caching=True) | ||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) | ||
# Querying the age of John Doe | ||
get_generation_time( | ||
llm, | ||
sampling_params, | ||
LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ", | ||
) | ||
# Querying the age of Zack Blue | ||
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again. | ||
get_generation_time( | ||
llm, | ||
sampling_params, | ||
LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ", | ||
) | ||
``` | ||
|
||
2. Expected output is shown as below: APC greatly reduces the generation time of the question related to the same table. | ||
|
||
```bash | ||
INFO 10-09 15:43:21 block_manager_v1.py:247] Automatic prefix caching is enabled. | ||
Processed prompts: 100%|█████████████████████████████████████████████████| 1/1 [00:21<00:00, 21.97s/it, est. speed input: 84.57 toks/s, output: 0.73 toks/s] | ||
Output: 29. | ||
Question: What is the occupation of Jane Smith? Your answer | ||
Generation time: 21.972806453704834 seconds. | ||
Processed prompts: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 1.04it/s, est. speed input: 1929.67 toks/s, output: 16.63 toks/s] | ||
Output: 30. | ||
Generation time: 0.9657604694366455 seconds. | ||
``` | ||
|
||
#### LoRA Adapter | ||
|
||
|
@@ -466,8 +566,6 @@ python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ | |
``` | ||
#### Cpu Offloading[todo] | ||
### Validated Models List | ||
| models (fp8) | gpus | | ||
|