diff --git a/README.md b/README.md index 8c75feb9..09902f69 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,8 @@ Efficient and easy multi-instance LLM serving ## 🔥 Latest News -- [2024.7] We officially released the first version of Llumnix! +- [2024.11] Llumnix v0.1.0 launched! +- [2024.7] We officially released the first version of Llumnix. - [2024.6] We released our OSDI '24 [research paper](https://arxiv.org/abs/2406.03243) on arxiv. ## 🚀 Why Llumnix @@ -22,14 +23,16 @@ Llumnix provides optimized multi-instance serving performance in terms of: - *Low latency* - **Reduced time-to-first-token** (TTFT) and queuing delays with less memory fragmentation - **Reduced time-between-tokens** (TBT) and preemption stalls with better load balancing -- *High throughput* with integration with state-of-the-art inference engines +- *High throughput* + - Integration with state-of-the-art inference engines + - Support for techniques like prefill-decoding disaggregation Llumnix achieves this with: - Dynamic, fine-grained, KV-cache-aware scheduling - Continuous **rescheduling** across instances - Enabled by a KV cache migration mechanism with near-zero overhead - - Exploited for continuous load balancing and de-fragmentation + - Exploited for continuous load balancing, de-fragmentation, and prefill-decoding disaggregation Llumnix is easy to use with: @@ -61,17 +64,17 @@ Visit our [documentation](./docs/) to get started: - [Prefill-decoding Disaggregation](./docs/Prefill-decoding_Disaggregation.md) ## Performance -We evaluate the performance of the KV-cache-aware load-balancing scheduler and migration mechanism of Llumnix with 16 Llama2-7B/Qwen1.5-7B instances, each using an A10 GPU (24GB). +We evaluate the performance of the KV-cache-aware load-balancing scheduler and migration mechanism of Llumnix with 16 Qwen2.5-7B instances (each using an A10-24GB GPU) and 16 Llama2-13B instances (each using an A800-80GB GPU). We use Poisson distributions with different request rates to generate request arrivals. For the input/output lengths of requests, we use ShareGPT dataset.