Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
IsaacRe committed Oct 2, 2024
1 parent 0000f54 commit 278d9fe
Showing 1 changed file with 138 additions and 18 deletions.
156 changes: 138 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,34 +5,154 @@
</picture>

</div>

<h3 align="center">
KV cache compression on our favorite inference server
KV cache compression for high-throughput LLM inference
</h3>
<p align="center">
<a href="https://arxiv.org/abs/2410.00161">Paper</a> |
<a href="https://blog.cloudflare.com/workers-ai/making-workers-ai-faster/">Blog</a> |
<a href="https://x.com/isaacrehg">X/Twitter</a>
</p>

---

This is a (messy) fork of vLLM v0.6.0 showcasing our new KV cache compression method that can achieve up to 5.18x throughput for single-instance deployments. Stay tuned for updates!
This is a (messy) fork of [vLLM](https://github.com/vllm-project/vllm) v0.6.0 showcasing our new KV cache compression method that increases throughput for memory-constrained LLM deployments.

<div style="display: flex; justify-content: space-between; align-items: center;">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="experiments/out-8b/throughtput_by_cr.jpg">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" style="max-width: 100%; width: auto;">
</picture>
<picture style="margin-right: 15px;">
<source media="(prefers-color-scheme: dark)" srcset="experiments/out-8b/longbench_score_by_cr.jpg">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" style="max-width: 100%; width: auto;">
</picture>
</div>
<img src="experiments/out-8b/longbench_score_by_cr.jpg" width=300px><img src="experiments/out-8b/throughtput_by_cr.jpg" width=300px>

## Setup

It is recommended to run within the NVIDIA PyTorch image:

```
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3
```

Install from source:
```
cd vllm-kvcompress/
pip install -e .
```

Alternatively, the prebuilt wheel can be used for x86 architectures:

```
pip install https://pub-ff08b7559526447fb14dd52ec4fac7c7.r2.dev/17da8eb/build/sm_89/vllm-0.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl
```

## Inference

The inference server can be launched with:
```
export model=meta-llama/Meta-Llama-3.1-8B-Instruct
vllm serve $model --enforce-eager --enable-kvc
```

Requests can then be sent with
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0,
"max_cache_tokens": 128,
"protected_window_size": 32,
"compress_once": false
}'
```

Compression can be configured per-request by setting the following additional sampling parameters:

- **max_cache_tokens** - Configure the max number of KVs to retain in cache for this sequence, computed as `num_layers * num_kv_heads * max_cache_tokens`
- **protected_window_size** - The window of final tokens for this sequence whose KVs will be protected during compression.
- **compress_once** - If set, the sequence will only be compressed during the first compression iteration after its prefill.


## Running Experiments

### LongBench
```
cd vllm-kvcompress/experiments/
```

To run experiments with a limited observation window (*KVC-w*):
```
export dataset=narrativeqa model=llama3 w=8 cache_size=128
python run_longbench.py \
--dataset $dataset \
--model $model \
--protected-window-size $w \
--prefill-metric-collection-window-size $w \
--max-cache-tokens $cache_size
```

To run experiments with full query-range aggregation (*KVC-full*):
```
python run_longbench.py \
--dataset $dataset \
--model $model \
--protected-window-size 32 \
--metric-collection-buffer-size 10 \
--prefill-metric-collection-window-size 33000 \
--prefill-metric-collection-block-size 1024 \
--no-maxpool-metrics \
--gpu-mem-util 0.6 \
--max-cache-tokens $cache_size
```
**Note:** Aggregating over the full query-range requires significant memory and should be run on an H100 or comparable GPU to avoid OOMs. Lowering `gpu-mem-util` will save more GPU memory for the aggregation and lowering `prefill-metric-collection-block-size` will lower the required memory for the aggregation, at the expense of longer execution time.

Experiments can be run with continual compression (compressing during decoding as well as on prefill) by adding the `--continual-compression` flag. To reproduce results in the paper, `--compression-rate` can be used to limit cache size instead of `--max-cache-tokens`:

```
export cr=64
python run_longbench.py \
--dataset $dataset \
--model $model \
--protected-window-size $w \
--prefill-metric-collection-window-size $w \
--continual-compression \
--compression-rate $cr
```

Run scripts used for our experiments can be found in `experiments/scripts`.

### Benchmark Throughput

```
cd vllm-kvcompress/
```
Run vLLM's benchmarking script with:
```
export model=meta-llama/Meta-Llama-3.1-8B-Instruct \
max_model_len=19000 input_len=6000 cr=64
python3 benchmarks/benchmark_throughput.py \
--model $model \
--max-model-len $max_model_len \
--enforce-eager \
--num-prompts 256 \
--input-len $input_len \
--output-len 500 \
--protected-window-size 32 \
--compression-rate $cr \
--enable-kvc
```

Run scripts used for our experiments can be found in `benchmarks/scripts`.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
If you use this work in research/projects of your own, please cite our [paper](https://arxiv.org/abs/2410.00161):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
@misc{rehg2024kvcompresspagedkvcachecompression,
title={KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head},
author={Isaac Rehg},
year={2024},
eprint={2410.00161},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.00161},
}
```

0 comments on commit 278d9fe

Please sign in to comment.