Skip to content

Commit

Permalink
[UPDATE] add prefix caching document into vllm_docker_quickstart.md (
Browse files Browse the repository at this point in the history
…#12173)

* [ADD] rewrite new vllm docker quick start

* [ADD] lora adapter doc finished

* [ADD] mulit lora adapter test successfully

* [ADD] add ipex-llm quantization doc

* [Merge] rebase main

* [REMOVE] rm tmp file

* [Merge] rebase main

* [ADD] add prefix caching experiment and result

* [REMOVE] rm cpu offloading chapter
  • Loading branch information
ACupofAir authored Oct 11, 2024
1 parent ddcdf47 commit 6ffaec6
Showing 1 changed file with 102 additions and 4 deletions.
106 changes: 102 additions & 4 deletions docs/mddocs/DockerGuides/vllm_docker_quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ We have included multiple vLLM-related files in `/llm/`:

|parameters|explanation|
|:---|:---|
|`model="YOUR_MODEL"`| the model path in docker, for example "/llm/models/Llama-2-7b-chat-hf"|
|`model="YOUR_MODEL"`| the model path in docker, for example `"/llm/models/Llama-2-7b-chat-hf"`|
|`load_in_low_bit="fp8"`| model quantization accuracy, acceptable ``'sym_int4'``, ``'asym_int4'``, ``'fp6'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'``; ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means asymmetric int 4, etc. Relevant low bit optimizations will be applied to the model. default is ``'fp8'``, which is the same as ``'fp8_e5m2'``|
|`tensor_parallel_size=1`| number of tensor parallel replicas, default is `1`|
|`pipeline_parallel_size=1`| number of pipeline stages, default is `1`|
Expand Down Expand Up @@ -382,7 +382,107 @@ curl http://localhost:8000/v1/chat/completions \
{"id":"chat-0c8ea64a2f8e42d9a8f352c160972455","object":"chat.completion","created":1728373105,"model":"MiniCPM-V-2_6","choices":[{"index":0,"message":{"role":"assistant","content":"这幅图片展示了一个小孩,可能是女孩,根据服装和发型来判断。她穿着一件有红色和白色条纹的连衣裙,一个可见的白色蝴蝶结,以及一个白色的 头饰,上面有红色的点缀。孩子右手拿着一个白色泰迪熊,泰迪熊穿着一个粉色的裙子,带有褶边,它的左脸颊上有一个红色的心形图案。背景模糊,但显示出一个自然户外的环境,可能是一个花园或庭院,有红花和石头墙。阳光照亮了整个场景,暗示这可能是正午或下午。整体氛围是欢乐和天真。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":225,"total_tokens":353,"completion_tokens":128}}
```

#### Preifx Caching[todo]
#### Preifx Caching

Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

1. Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example python script to show the time reduce of APC:

```python
import time
from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1 | John Doe | 29 | Engineer | USA | [email protected] | 555-1234 | 123 Elm St, Springfield, IL |
| 2 | Jane Smith | 34 | Doctor | Canada | [email protected] | 555-5678 | 456 Oak St, Toronto, ON |
| 3 | Alice Johnson | 27 | Teacher | UK | [email protected] | 555-8765 | 789 Pine St, London, UK |
| 4 | Bob Brown | 45 | Artist | Australia | [email protected] | 555-4321 | 321 Maple St, Sydney, NSW |
| 5 | Carol White | 31 | Scientist | New Zealand | [email protected] | 555-6789 | 654 Birch St, Wellington, NZ |
| 6 | Dave Green | 28 | Lawyer | Ireland | [email protected] | 555-3456 | 987 Cedar St, Dublin, IE |
| 7 | Emma Black | 40 | Musician | USA | [email protected] | 555-1111 | 246 Ash St, New York, NY |
| 8 | Frank Blue | 37 | Chef | Canada | [email protected] | 555-2222 | 135 Spruce St, Vancouver, BC |
| 9 | Grace Yellow | 50 | Engineer | UK | [email protected] | 555-3333 | 864 Fir St, Manchester, UK |
| 10 | Henry Violet | 32 | Artist | Australia | [email protected] | 555-4444 | 753 Willow St, Melbourne, VIC|
| 11 | Irene Orange | 26 | Scientist | New Zealand | [email protected] | 555-5555 | 912 Poplar St, Auckland, NZ |
| 12 | Jack Indigo | 38 | Teacher | Ireland | [email protected] | 555-6666 | 159 Elm St, Cork, IE |
| 13 | Karen Red | 41 | Lawyer | USA | [email protected] | 555-7777 | 357 Cedar St, Boston, MA |
| 14 | Leo Brown | 30 | Chef | Canada | [email protected] | 555-8888 | 246 Oak St, Calgary, AB |
| 15 | Mia Green | 33 | Musician | UK | [email protected] | 555-9999 | 975 Pine St, Edinburgh, UK |
| 16 | Noah Yellow | 29 | Doctor | Australia | [email protected] | 555-0000 | 864 Birch St, Brisbane, QLD |
| 17 | Olivia Blue | 35 | Engineer | New Zealand | [email protected] | 555-1212 | 753 Maple St, Hamilton, NZ |
| 18 | Peter Black | 42 | Artist | Ireland | [email protected] | 555-3434 | 912 Fir St, Limerick, IE |
| 19 | Quinn White | 28 | Scientist | USA | [email protected] | 555-5656 | 159 Willow St, Seattle, WA |
| 20 | Rachel Red | 31 | Teacher | Canada | [email protected] | 555-7878 | 357 Poplar St, Ottawa, ON |
| 21 | Steve Green | 44 | Lawyer | UK | [email protected] | 555-9090 | 753 Elm St, Birmingham, UK |
| 22 | Tina Blue | 36 | Musician | Australia | [email protected] | 555-1213 | 864 Cedar St, Perth, WA |
| 23 | Umar Black | 39 | Chef | New Zealand | [email protected] | 555-3435 | 975 Spruce St, Christchurch, NZ|
| 24 | Victor Yellow | 43 | Engineer | Ireland | [email protected] | 555-5657 | 246 Willow St, Galway, IE |
| 25 | Wendy Orange | 27 | Artist | USA | [email protected] | 555-7879 | 135 Elm St, Denver, CO |
| 26 | Xavier Green | 34 | Scientist | Canada | [email protected] | 555-9091 | 357 Oak St, Montreal, QC |
| 27 | Yara Red | 41 | Teacher | UK | [email protected] | 555-1214 | 975 Pine St, Leeds, UK |
| 28 | Zack Blue | 30 | Lawyer | Australia | [email protected] | 555-3436 | 135 Birch St, Adelaide, SA |
| 29 | Amy White | 33 | Musician | New Zealand | [email protected] | 555-5658 | 159 Maple St, Wellington, NZ |
| 30 | Ben Black | 38 | Chef | Ireland | [email protected] | 555-7870 | 246 Fir St, Waterford, IE |
"""
def get_generation_time(llm, sampling_params, prompts):
# time the generation
start_time = time.time()
output = llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
# print the output and generation time
print(f"Output: {output[0].outputs[0].text}")
print(f"Generation time: {end_time - start_time} seconds.")
# set enable_prefix_caching=True to enable APC
llm = LLM(model='/llm/models/Llama-2-7b-chat-hf',
device="xpu",
dtype="float16",
enforce_eager=True,
load_in_low_bit="fp8",
tensor_parallel_size=1,
max_model_len=2000,
max_num_batched_tokens=2000,
enable_prefix_caching=True)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Querying the age of John Doe
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
)
# Querying the age of Zack Blue
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
)
```

2. Expected output is shown as below: APC greatly reduces the generation time of the question related to the same table.

```bash
INFO 10-09 15:43:21 block_manager_v1.py:247] Automatic prefix caching is enabled.
Processed prompts: 100%|█████████████████████████████████████████████████| 1/1 [00:21<00:00, 21.97s/it, est. speed input: 84.57 toks/s, output: 0.73 toks/s]
Output: 29.
Question: What is the occupation of Jane Smith? Your answer
Generation time: 21.972806453704834 seconds.
Processed prompts: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 1.04it/s, est. speed input: 1929.67 toks/s, output: 16.63 toks/s]
Output: 30.
Generation time: 0.9657604694366455 seconds.
```

#### LoRA Adapter

Expand Down Expand Up @@ -466,8 +566,6 @@ python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
```
#### Cpu Offloading[todo]
### Validated Models List
| models (fp8) | gpus |
Expand Down

0 comments on commit 6ffaec6

Please sign in to comment.