From 278d9fed70249212e12a5210244847bdeeb29bfe Mon Sep 17 00:00:00 2001
From: Isaac Rehg <isaacrehg@gmail.com>
Date: Wed, 2 Oct 2024 10:49:25 -0500
Subject: [PATCH] update README

---
 README.md | 156 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 138 insertions(+), 18 deletions(-)
diff --git a/README.md b/README.md
index 5c7741ef..d63bbd18 100644
--- a/README.md
+++ b/README.md
@@ -5,34 +5,154 @@
   </picture>
 
 </div>
+
 <h3 align="center">
-KV cache compression on our favorite inference server
+KV cache compression for high-throughput LLM inference
 </h3>
+<p align="center">
+<a href="https://arxiv.org/abs/2410.00161">Paper</a> |
+<a href="https://blog.cloudflare.com/workers-ai/making-workers-ai-faster/">Blog</a> |
+<a href="https://x.com/isaacrehg">X/Twitter</a>
+</p>
 
 ---
 
-This is a (messy) fork of vLLM v0.6.0 showcasing our new KV cache compression method that can achieve up to 5.18x throughput for single-instance deployments. Stay tuned for updates!
+This is a (messy) fork of [vLLM](https://github.com/vllm-project/vllm) v0.6.0 showcasing our new KV cache compression method that increases throughput for memory-constrained LLM deployments.
 
-<div style="display: flex; justify-content: space-between; align-items: center;">
-  <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="experiments/out-8b/throughtput_by_cr.jpg">
-    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" style="max-width: 100%; width: auto;">
-  </picture>
-  <picture style="margin-right: 15px;">
-    <source media="(prefers-color-scheme: dark)" srcset="experiments/out-8b/longbench_score_by_cr.jpg">
-    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" style="max-width: 100%; width: auto;">
-  </picture>
-</div>
+<img src="experiments/out-8b/longbench_score_by_cr.jpg" width=300px><img src="experiments/out-8b/throughtput_by_cr.jpg" width=300px>
+
+## Setup
+
+It is recommended to run within the NVIDIA PyTorch image:
+
+```
+docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3
+```
+
+Install from source:
+```
+cd vllm-kvcompress/
+pip install -e .
+```
+
+Alternatively, the prebuilt wheel can be used for x86 architectures:
+
+```
+pip install https://pub-ff08b7559526447fb14dd52ec4fac7c7.r2.dev/17da8eb/build/sm_89/vllm-0.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl
+```
+
+## Inference
+
+The inference server can be launched with:
+```
+export model=meta-llama/Meta-Llama-3.1-8B-Instruct
+vllm serve $model --enforce-eager --enable-kvc
+```
+
+Requests can then be sent with
+```
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+    "prompt": "San Francisco is a",
+    "max_tokens": 7,
+    "temperature": 0,
+    "max_cache_tokens": 128,
+    "protected_window_size": 32,
+    "compress_once": false
+  }'
+```
+
+Compression can be configured per-request by setting the following additional sampling parameters:
+
+ - **max_cache_tokens** - Configure the max number of KVs to retain in cache for this sequence, computed as `num_layers * num_kv_heads * max_cache_tokens`
+ - **protected_window_size** - The window of final tokens for this sequence whose KVs will be protected during compression.
+ - **compress_once** - If set, the sequence will only be compressed during the first compression iteration after its prefill.
+
+
+## Running Experiments
+
+### LongBench
+```
+cd vllm-kvcompress/experiments/
+```
+
+To run experiments with a limited observation window (*KVC-w*):
+```
+export dataset=narrativeqa model=llama3 w=8 cache_size=128
+python run_longbench.py \
+  --dataset $dataset \
+  --model $model \
+  --protected-window-size $w \
+  --prefill-metric-collection-window-size $w \
+  --max-cache-tokens $cache_size
+```
+
+To run experiments with full query-range aggregation (*KVC-full*):
+```
+python run_longbench.py \
+  --dataset $dataset \
+  --model $model \
+  --protected-window-size 32 \
+  --metric-collection-buffer-size 10 \
+  --prefill-metric-collection-window-size 33000 \
+  --prefill-metric-collection-block-size 1024 \
+  --no-maxpool-metrics \
+  --gpu-mem-util 0.6 \
+  --max-cache-tokens $cache_size
+```
+**Note:** Aggregating over the full query-range requires significant memory and should be run on an H100 or comparable GPU to avoid OOMs. Lowering `gpu-mem-util` will save more GPU memory for the aggregation and lowering `prefill-metric-collection-block-size` will lower the required memory for the aggregation, at the expense of longer execution time.
+
+Experiments can be run with continual compression (compressing during decoding as well as on prefill) by adding the `--continual-compression` flag. To reproduce results in the paper, `--compression-rate` can be used to limit cache size instead of `--max-cache-tokens`:
+
+```
+export cr=64
+python run_longbench.py \
+  --dataset $dataset \
+  --model $model \
+  --protected-window-size $w \
+  --prefill-metric-collection-window-size $w \
+  --continual-compression \
+  --compression-rate $cr
+```
+
+Run scripts used for our experiments can be found in `experiments/scripts`.
+
+### Benchmark Throughput
+
+```
+cd vllm-kvcompress/
+```
+Run vLLM's benchmarking script with:
+```
+export model=meta-llama/Meta-Llama-3.1-8B-Instruct \
+  max_model_len=19000 input_len=6000 cr=64
+python3 benchmarks/benchmark_throughput.py \
+  --model $model \
+  --max-model-len $max_model_len \
+  --enforce-eager \
+  --num-prompts 256 \
+  --input-len $input_len \
+  --output-len 500 \
+  --protected-window-size 32 \
+  --compression-rate $cr \
+  --enable-kvc
+```
 
+Run scripts used for our experiments can be found in `benchmarks/scripts`.
 
 ## Citation
 
-If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
+If you use this work in research/projects of your own, please cite our [paper](https://arxiv.org/abs/2410.00161):
 ```bibtex
-@inproceedings{kwon2023efficient,
-  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
-  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
-  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
-  year={2023}
+@misc{rehg2024kvcompresspagedkvcachecompression,
+      title={KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head},
+      author={Isaac Rehg},
+      year={2024},
+      eprint={2410.00161},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2410.00161},
 }
 ```