Skip to content

v0.7.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 27 Jan 05:50
· 49 commits to main since this release
5204ff5

Highlights

  • vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
  • New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
  • torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

  • 400 commits from 132 contributors, including 57 new contributors.
    • 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
    • 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
    • more than 161 bug fixes and miscellaneous enhancements

Features

Models

Hardwares

Features

  • Distributed:
    • Support torchrun and SPMD-style offline inference (#12071)
    • New collective_rpc abstraction (#12151, #11256)
  • API Server: Jina- and Cohere-compatible Rerank API (#12376)
  • Kernels:
    • Flash Attention 3 Support (#12093)
    • Punica prefill kernels fusion (#11234)
    • For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

  • Benchmark: new script for CPU offloading (#11533)
  • Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

New Contributors

Full Changelog: v0.6.6...v0.7.0