Release v0.7.0 · vllm-project/vllm

Highlights

vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements

Features

Models

New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the get_*_embeddings methods according to this guide is automatically supported by V1 engine.

Hardwares

Apple: Native support for macOS Apple Silicon (#11696)
AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
TPU: support for W8A8 (#11785)
x86: Multi-LoRA (#11100) and MoE Support (#11831)
Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)

Features

Distributed:
- Support torchrun and SPMD-style offline inference (#12071)
- New collective_rpc abstraction (#12151, #11256)
API Server: Jina- and Cohere-compatible Rerank API (#12376)
Kernels:
- Flash Attention 3 Support (#12093)
- Punica prefill kernels fusion (#11234)
- For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

Benchmark: new script for CPU offloading (#11533)
Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in #11535
Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
[V1] Fix yapf by @WoosukKwon in #11538
[CI] Fix broken CI by @robertgshaw2-redhat in #11543
[misc] fix typing by @youkaichao in #11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
[Platform] Move model arch check to platform by @MengqingCao in #11503
Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
[Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
[Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
[Doc] Add xgrammar in doc by @Chen-0210 in #11549
[VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
[MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
[Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
[Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
[Doc] Update mllama example based on official doc by @heheda12345 in #11567
[V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
[Bugfix] Last token measurement fix by @rajveerb in #11376
[Model] Support InternLM2 Reward models by @Isotr0py in #11571
[Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
[Doc] Minor documentation fixes by @DarkLight1337 in #11580
[bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
[V1] [5/N] API Server: unify Detokenizer and EngineCore input by @robertgshaw2-redhat in #11545
[Doc] Convert list tables to MyST by @DarkLight1337 in #11594
[v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
[Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
[v1] fix compilation cache by @youkaichao in #11598
[Docker] bump up neuron sdk v2.21 by @liangfu in #11593
[Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
[platforms] enable platform plugins by @youkaichao in #11602
[VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
[V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
[benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
[Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
[Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
[V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
[V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
[Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
[Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
[VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
[Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
[Misc] Replace space with - in the file names by @houseroad in #11667
[Doc] Fix typo by @serihiro in #11666
[V1] Implement Cascade Attention by @WoosukKwon in #11635
[VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
[mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
[VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
[V1][Minor] Optimize token_ids_cpu copy by @WoosukKwon in #11692
[Bugfix] Change kv scaling factor by param json on nvidia gpu by @bjmsong in #11688
Resolve race conditions in Marlin kernel by @wchen61 in #11493
[Misc] Minimum requirements for SageMaker compatibility by @nathan-az in #11576
Update default max_num_batch_tokens for chunked prefill by @SachinVarghese in #11694
[Bugfix] Check chain_speculative_sampling before calling it by @houseroad in #11673
[perf-benchmark] Fix dependency for steps in benchmark pipeline by @khluu in #11710
[Model] Whisper model implementation by @aurickq in #11280
[V1] Simplify Shutdown by @robertgshaw2-redhat in #11659
[Bugfix] Fix ColumnParallelLinearWithLoRA slice by @zinccat in #11708
[V1] Improve TP>1 Error Handling + Stack Trace by @robertgshaw2-redhat in #11721
[Misc]Add BNB quantization for Qwen2VL by @jeejeelee in #11719
Update requirements-tpu.txt to support python 3.9 and 3.11 by @mgoin in #11695
[V1] Chore: cruft removal by @robertgshaw2-redhat in #11724
log GPU blocks num for MultiprocExecutor by @WangErXiao in #11656
Update tool_calling.md by @Bryce1010 in #11701
Update bnb.md with example for OpenAI by @bet0x in #11718
[V1] Add RayExecutor support for AsyncLLM (api server) by @jikunshang in #11712
[V1] Add kv cache utils tests. by @xcnick in #11513
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture by @yanburman in #11233
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision by @DarkLight1337 in #11717
[Bugfix] Fix precision error in LLaVA-NeXT feature size calculation by @DarkLight1337 in #11735
[Model] Remove unnecessary weight initialization logic by @DarkLight1337 in #11736
[Bugfix][V1] Fix test_kv_cache_utils.py by @jeejeelee in #11738
[MISC] Replace c10::optional with std::optional by @houseroad in #11730
[distributed] remove pynccl's redundant stream by @cennn in #11744
fix: [doc] fix typo by @RuixiangMa in #11751
[Frontend] Improve StreamingResponse Exception Handling by @robertgshaw2-redhat in #11752
[distributed] remove pynccl's redundant change_state by @cennn in #11749
[Doc] [1/N] Reorganize Getting Started section by @DarkLight1337 in #11645
[Bugfix] Remove block size constraint by @comaniac in #11723
[V1] Add BlockTable class by @WoosukKwon in #11693
[Misc] Fix typo for valid_tool_parses by @ruisearch42 in #11753
[V1] Refactor get_executor_cls by @ruisearch42 in #11754
[mypy] Forward pass function type hints in lora by @lucas-tucker in #11740
k8s-config: Update the secret to use stringData by @surajssd in #11679
[VLM] Separate out profiling-related logic by @DarkLight1337 in #11746
[Doc][2/N] Reorganize Models and Usage sections by @DarkLight1337 in #11755
[Bugfix] Fix max image size for LLaVA-Onevision by @ywang96 in #11769
[doc] explain how to add interleaving sliding window support by @youkaichao in #11771
[Bugfix][V1] Fix molmo text-only inputs by @jeejeelee in #11676
[Kernel] Move attn_type to Attention.init() by @heheda12345 in #11690
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision by @ywang96 in #11685
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) by @DarkLight1337 in #11772
[Model] Future-proof Qwen2-Audio multi-modal processor by @DarkLight1337 in #11776
[XPU] Make pp group initilized for pipeline-parallelism by @ys950902 in #11648
[Doc][3/N] Reorganize Serving section by @DarkLight1337 in #11766
[Kernel][LoRA]Punica prefill kernels fusion by @jeejeelee in #11234
[Bugfix] Update attention interface in Whisper by @ywang96 in #11784
[CI] Fix neuron CI and run offline tests by @liangfu in #11779
fix init error for MessageQueue when n_local_reader is zero by @XiaobingSuper in #11768
[Doc] Create a vulnerability management team by @russellb in #9925
[CI][CPU] adding build number to docker image name by @zhouyuan in #11788
[V1][Doc] Update V1 support for LLaVa-NeXT-Video by @ywang96 in #11798
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation by @DarkLight1337 in #11800
[doc] add doc to explain how to use uv by @youkaichao in #11773
[V1] Support audio language models on V1 by @ywang96 in #11733
[doc] update how pip can install nightly wheels by @youkaichao in #11806
[Doc] Add note to gte-Qwen2 models by @DarkLight1337 in #11808
[optimization] remove python function call for custom op by @youkaichao in #11750
[Bugfix] update the prefix for qwen2 by @jiangjiadi in #11795
[Doc]Add documentation for using EAGLE in vLLM by @sroy745 in #11417
[Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 by @DamonFool in #11794
[Doc] Group examples into categories by @hmellor in #11782
[Bugfix] Fix image input for Pixtral-HF by @DarkLight1337 in #11741
[Misc] sort torch profiler table by kernel timing by @divakar-amd in #11813
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… by @WangErXiao in #11824
Fixed docker build for ppc64le by @npanpaliya in #11518
[OpenVINO] Fixed Docker.openvino build by @ilya-lavrenov in #11732
[Bugfix] Add checks for LoRA and CPU offload by @jeejeelee in #11810
[Docs] reorganize sponsorship page by @simon-mo in #11639
[Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used by @DarkLight1337 in #11825
[misc] improve memory profiling by @youkaichao in #11809
[doc] update wheels url by @youkaichao in #11830
[Docs] Update sponsor name: 'Novita' to 'Novita AI' by @simon-mo in #11833
[Hardware][Apple] Native support for macOS Apple Silicon by @wallashss in #11696
[torch.compile] consider relevant code in compilation cache by @youkaichao in #11614
[VLM] Reorganize profiling/processing-related code by @DarkLight1337 in #11812
[Doc] Move examples into categories by @hmellor in #11840
[Doc][4/N] Reorganize API Reference by @DarkLight1337 in #11843
[CI/Build][Bugfix] Fix CPU CI image clean up by @bigPYJ1151 in #11836
[Bugfix][XPU] fix silu_and_mul by @yma11 in #11823
[Misc] Move some model utils into vision file by @DarkLight1337 in #11848
[Doc] Expand Multimodal API Reference by @DarkLight1337 in #11852
[Misc]add some explanations for BlockHashType by @WangErXiao in #11847
[TPU][Quantization] TPU W8A8 by @robertgshaw2-redhat in #11785
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models by @rasmith in #11698
[Docs] Add Google Cloud Meetup by @simon-mo in #11864
[CI] Turn on basic correctness tests for V1 by @tlrmchlsmth in #10864
treat do_lower_case in the same way as the sentence-transformers library by @maxdebayser in #11815
[Doc] Recommend uv and python 3.12 for quickstart guide by @mgoin in #11849
[Misc] Move print_*_once from utils to logger by @DarkLight1337 in #11298
[Doc] Intended links Python multiprocessing library by @guspan-tanadi in #11878
[perf]fix current stream by @youkaichao in #11870
[Bugfix] Override dunder methods of placeholder modules by @DarkLight1337 in #11882
[Bugfix] fix beam search input errors and latency benchmark script by @yeqcharlotte in #11875
[Doc] Add model development API Reference by @DarkLight1337 in #11884
[platform] Allow platform specify attention backend by @wangxiyuan in #11609
[ci]try to fix flaky multi-step tests by @youkaichao in #11894
[Misc] Provide correct Pixtral-HF chat template by @DarkLight1337 in #11891
[Docs] Add Modal to deployment frameworks by @charlesfrye in #11907
[Doc][5/N] Move Community and API Reference to the bottom by @DarkLight1337 in #11896
[VLM] Enable tokenized inputs for merged multi-modal processor by @DarkLight1337 in #11900
[Doc] Show default pooling method in a table by @DarkLight1337 in #11904
[torch.compile] Hide KV cache behind torch.compile boundary by @heheda12345 in #11677
[Bugfix] Validate lora adapters to avoid crashing server by @joerunde in #11727
[BUGFIX] Fix UnspecifiedPlatform package name by @jikunshang in #11916
[ci] fix gh200 tests by @youkaichao in #11919
[optimization] remove python function call for custom activation op by @cennn in #11885
[platform] support pytorch custom op pluggable by @wangxiyuan in #11328
Replace "online inference" with "online serving" by @hmellor in #11923
[ci] Fix sampler tests by @youkaichao in #11922
[Doc] [1/N] Initial guide for merged multi-modal processor by @DarkLight1337 in #11925
[platform] support custom torch.compile backend key by @wangxiyuan in #11318
[Doc] Rename offline inference examples by @hmellor in #11927
[Docs] Fix docstring in get_ip function by @KuntaiDu in #11932
[Doc] Docstring fix in benchmark_long_document_qa_throughput.py by @KuntaiDu in #11933
[Hardware][CPU] Support MOE models on x86 CPU by @bigPYJ1151 in #11831
[Misc] Clean up debug code in Deepseek-V3 by @Isotr0py in #11930
[Misc] Update benchmark_prefix_caching.py fixed example usage by @remimin in #11920
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #11939
[mypy] Fix mypy warnings in api_server.py by @frreiss in #11941
[ci] fix broken distributed-tests-4-gpus by @youkaichao in #11937
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design by @llsj14 in #11672
[Bugfix] fused_experts_impl wrong compute type for float32 by @shaochangxu in #11921
[CI/Build] Move model-specific multi-modal processing tests by @DarkLight1337 in #11934
[Doc] Basic guide for writing unit tests for new models by @DarkLight1337 in #11951
[Bugfix] Fix RobertaModel loading by @NickLucche in #11940
[Model] Add cogagent model support vLLM by @sixsixcoder in #11742
[V1] Avoid sending text prompt to core engine by @ywang96 in #11963
[CI/Build] Add markdown linter by @rafvasq in #11857
[Model] Initialize support for Deepseek-VL2 models by @Isotr0py in #11578
[Hardware][CPU] Multi-LoRA implementation for the CPU backend by @Akshat-Tripathi in #11100
[Hardware][TPU] workaround fix for MoE on TPU by @avshalomman in #11764
[V1][Core][1/n] Logging and Metrics by @robertgshaw2-redhat in #11962
[Model] Support GGUF models newly added in transformers 4.46.0 by @Isotr0py in #9685
[V1] [2/n] Logging and Metrics - OutputProcessor Abstraction by @robertgshaw2-redhat in #11973
[MISC] fix typo in kv transfer send recv test by @yyccli in #11983
[Bug] Fix usage of .transpose() and .view() consecutively. by @liaoyanqing666 in #11979
[CI][Spec Decode] fix: broken test for EAGLE model by @llsj14 in #11972
[Misc] Fix Deepseek V2 fp8 kv-scale remapping by @Concurrensee in #11947
[Misc]Minor Changes about Worker by @noemotiovon in #11555
[platform] add ray_device_key by @youkaichao in #11948
Fix Max Token ID for Qwen-VL-Chat by @alex-jw-brooks in #11980
[Kernel] Attention.forward with unified_attention when use_direct_call=True by @heheda12345 in #11967
[Doc][V1] Update model implementation guide for V1 support by @ywang96 in #11998
[Doc] Organise installation documentation into categories and tabs by @hmellor in #11935
[platform] add device_control env var by @youkaichao in #12009
[Platform] Move get_punica_wrapper() function to Platform by @shen-shanshan in #11516
bugfix: Fix signature mismatch in benchmark's get_tokenizer function by @e1ijah1 in #11982
[Doc] Fix build from source and installation link in README.md by @Yikun in #12013
[Bugfix] Fix deepseekv3 gate bias error by @SunflowerAries in #12002
[Docs] Add Sky Computing Lab to project intro by @WoosukKwon in #12019
[Hardware][Gaudi][Bugfix] Fix set_forward_context arguments and CI test execution by @kzawora-intel in #12014
[Doc] Update Quantization Hardware Support Documentation by @tjtanaa in #12025
[HPU][misc] add comments for explanation by @youkaichao in #12034
[Bugfix] Fix various bugs in multi-modal processor by @DarkLight1337 in #12031
[Kernel] Revert the API change of Attention.forward by @heheda12345 in #12038
[Platform] Add output for Attention Backend by @wangxiyuan in #11981
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention by @heheda12345 in #12040
Explain where the engine args go when using Docker by @hmellor in #12041
[Doc]: Update the Json Example of the Engine Arguments document by @maang-h in #12045
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping by @jeejeelee in #11924
[Kernel] Support MulAndSilu by @jeejeelee in #11624
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py by @kzawora-intel in #12046
[Platform] Refactor current_memory_usage() function in DeviceMemoryProfiler to Platform by @shen-shanshan in #11369
[V1][BugFix] Fix edge case in VLM scheduling by @WoosukKwon in #12065
[Misc] Add multipstep chunked-prefill support for FlashInfer by @elfiegg in #10467
[core] Turn off GPU communication overlap for Ray executor by @ruisearch42 in #12051
[core] platform agnostic executor via collective_rpc by @youkaichao in #11256
[Doc] Update examples to remove SparseAutoModelForCausalLM by @kylesayrs in #12062
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager by @heheda12345 in #12003
Fix: cases with empty sparsity config by @rahul-tuli in #12057
Type-fix: make execute_model output type optional by @youngkent in #12020
[Platform] Do not raise error if _Backend is not found by @wangxiyuan in #12023
[Model]: Support internlm3 by @RunningLeon in #12037
Misc: allow to use proxy in HTTPConnection by @zhouyuan in #12042
[Misc][Quark] Upstream Quark format to VLLM by @kewang-xlnx in #10765
[Doc]: Update OpenAI-Compatible Server documents by @maang-h in #12082
[Bugfix] use right truncation for non-generative tasks by @joerunde in #12050
[V1][Core] Autotune encoder cache budget by @ywang96 in #11895
[Bugfix] Fix _get_lora_device for HQQ marlin by @varun-sundar-rabindranath in #12090
Allow hip sources to be directly included when compiling for rocm. by @tvirolai-amd in #12087
[Core] Default to using per_token quantization for fp8 when cutlass is supported. by @elfiegg in #8651
[Doc] Add documentation for specifying model architecture by @DarkLight1337 in #12105
Various cosmetic/comment fixes by @mgoin in #12089
[Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 by @Isotr0py in #12067
Support torchrun and SPMD-style offline inference by @youkaichao in #12071
[core] LLM.collective_rpc interface and RLHF example by @youkaichao in #12084
[Bugfix] Fix max image feature size for Llava-one-vision by @ywang96 in #12104
[misc] Add LoRA kernel micro benchmarks by @varun-sundar-rabindranath in #11579
[Model] Add support for deepseek-vl2-tiny model by @Isotr0py in #12068
[Bugfix] Set enforce_eager automatically for mllama by @heheda12345 in #12127
[Bugfix] Fix a path bug in disaggregated prefill example script. by @KuntaiDu in #12121
[CI]add genai-perf benchmark in nightly benchmark by @jikunshang in #10704
[Doc] Add instructions on using Podman when SELinux is active by @terrytangyuan in #12136
[Bugfix] Revert PR #11435: Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #12135
[BugFix] add more is not None check in VllmConfig.post_init by @heheda12345 in #12138
[Misc] Add deepseek_vl2 chat template by @Isotr0py in #12143
[ROCm][MoE] moe tuning support for rocm by @divakar-amd in #12049
[V1] Move more control of kv cache initialization from model_executor to EngineCore by @heheda12345 in #11960
[Misc][LoRA] Improve the readability of LoRA error messages during loading by @jeejeelee in #12102
[CI/Build][CPU][Bugfix] Fix CPU CI by @bigPYJ1151 in #12150
[core] allow callable in collective_rpc by @youkaichao in #12151
[Bugfix] Fix score api for missing max_model_len validation by @wallashss in #12119
[Bugfix] Mistral tokenizer encode accept list of str by @jikunshang in #12149
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant by @gshtras in #12134
[torch.compile] disable logging when cache is disabled by @youkaichao in #12043
[misc] fix cross-node TP by @youkaichao in #12166
[AMD][CI/Build][Bugfix] updated pytorch stale wheel path by using stable wheel by @hongxiayang in #12172
[core] further polish memory profiling by @youkaichao in #12126
[Docs] Fix broken link in SECURITY.md by @russellb in #12175
[Model] Port deepseek-vl2 processor and remove deepseek_vl2 dependency by @Isotr0py in #12169
[core] clean up executor class hierarchy between v1 and v0 by @youkaichao in #12171
[Misc] Support register quantization method out-of-tree by @ice-tong in #11969
[V1] Collect env var for usage stats by @simon-mo in #12115
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu by @madamczykhabana in #12152
[Bugfix] Fix multi-modal processors for transformers 4.48 by @DarkLight1337 in #12187
[torch.compile] store inductor compiled Python file by @youkaichao in #12182
benchmark_serving support --served-model-name param by @gujingit in #12109
[Misc] Add BNB support to GLM4-V model by @Isotr0py in #12184
[V1] Add V1 support of Qwen2-VL by @ywang96 in #12128
[Model] Support for fairseq2 Llama by @MartinGleize in #11442
[Bugfix] Fix num_heads value for simple connector when tp enabled by @ShangmingCai in #12074
[torch.compile] fix sym_tensor_indices by @youkaichao in #12191
Move linting to pre-commit by @hmellor in #11975
[DOC] Fix typo in SingleStepOutputProcessor docstring and assert message by @terrytangyuan in #12194
[DOC] Add missing docstring for additional args in LLMEngine.add_request() by @terrytangyuan in #12195
[Bugfix] Fix incorrect types in LayerwiseProfileResults by @terrytangyuan in #12196
[Model] Add Qwen2 PRM model support by @Isotr0py in #12202
[Core] Interface for accessing model from VllmRunner by @DarkLight1337 in #10353
[misc] add placeholder format.sh by @youkaichao in #12206
[CI/Build] Remove dummy CI steps by @DarkLight1337 in #12208
[CI/Build] Make pre-commit faster by @DarkLight1337 in #12212
[Model] Upgrade Aria to transformers 4.48 by @DarkLight1337 in #12203
[misc] print a message to suggest how to bypass commit hooks by @youkaichao in #12217
[core][bugfix] configure env var during import vllm by @youkaichao in #12209
[V1] Remove _get_cache_block_size by @heheda12345 in #12214
[Misc] Pass attention to impl backend by @wangxiyuan in #12218
[Bugfix] Fix HfExampleModels.find_hf_info by @DarkLight1337 in #12223
[CI] Pass local python version explicitly to pre-commit mypy.sh by @heheda12345 in #12224
[Misc] Update CODEOWNERS by @ywang96 in #12229
fix: update platform detection for M-series arm based MacBook processors by @isikhi in #12227
[misc] add cuda runtime version to usage data by @youkaichao in #12190
[bugfix] catch xgrammar unsupported array constraints by @Jason-CKY in #12210
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) by @jinzhen-lin in #12222
Add quantization and guided decoding CODEOWNERS by @mgoin in #12228
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork by @gshtras in #11777
[BugFix] Fix GGUF tp>1 models when vocab_size is not divisible by 64 by @NickLucche in #12230
[ci/build] disable failed and flaky tests by @youkaichao in #12240
[Misc] Rename MultiModalInputsV2 -> MultiModalInputs by @DarkLight1337 in #12244
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration by @jeejeelee in #12237
[Misc] Remove redundant TypeVar from base model by @DarkLight1337 in #12248
[Bugfix] Fix mm_limits access for merged multi-modal processor by @DarkLight1337 in #12252
[torch.compile] transparent compilation with more logging by @youkaichao in #12246
[V1][Bugfix] Fix data item ordering in mixed-modality inference by @ywang96 in #12259
[Bugfix] Remove comments re: pytorch for outlines + compressed-tensors dependencies by @tdoublep in #12260
[Platform] improve platforms getattr by @MengqingCao in #12264
[ci/build] add nightly torch for test by @youkaichao in #12270
[Bugfix] fix race condition that leads to wrong order of token returned by @joennlae in #10802
[Kernel] fix moe_align_block_size error condition by @jinzhen-lin in #12239
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types by @rickyyx in #10907
[Bugfix] Multi-sequence broken by @andylolu2 in #11898
[Misc] Remove experimental dep from tracing.py by @codefromthecrypt in #12007
[Misc] Set default backend to SDPA for get_vit_attn_backend by @wangxiyuan in #12235
[Core] Free CPU pinned memory on environment cleanup by @janimo in #10477
[bugfix] moe tuning. rm is_navi() by @divakar-amd in #12273
[BUGFIX] When skip_tokenize_init and multistep are set, execution crashes by @maleksan85 in #12277
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose by @hongxiayang in #12281
[VLM] Simplify post-processing of replacement info by @DarkLight1337 in #12269
[ci/lint] Add back default arg for pre-commit by @khluu in #12279
[CI] add docker volume prune to neuron CI by @liangfu in #12291
[Ci/Build] Fix mypy errors on main by @DarkLight1337 in #12296
[Benchmark] More accurate TPOT calc in benchmark_serving.py by @njhill in #12288
[core] separate builder init and builder prepare for each batch by @youkaichao in #12253
[Build] update requirements of no-device by @MengqingCao in #12299
[Core] Support fully transparent sleep mode by @youkaichao in #11743
[VLM] Avoid unnecessary tokenization by @DarkLight1337 in #12310
[Model][Bugfix]: correct Aria model output by @xffxff in #12309
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 by @ywang96 in #12313
[Doc] Add docs for prompt replacement by @DarkLight1337 in #12318
[Misc] Fix the error in the tip for the --lora-modules parameter by @WangErXiao in #12319
[Misc] Improve the readability of BNB error messages by @jeejeelee in #12320
[Hardware][Gaudi][Bugfix] Fix HPU tensor parallelism, enable multiprocessing executor by @kzawora-intel in #12167
[Core] Support reset_prefix_cache by @comaniac in #12284
[Frontend][V1] Online serving performance improvements by @njhill in #12287
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD by @rasmith in #12282
[Bugfix] Fixing AMD LoRA CI test. by @Alexei-V-Ivanov-AMD in #12329
[Docs] Update FP8 KV Cache documentation by @mgoin in #12238
[Docs] Document vulnerability disclosure process by @russellb in #12326
[V1] Add uncache_blocks by @comaniac in #12333
[doc] explain common errors around torch.compile by @youkaichao in #12340
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update by @zhenwei-intel in #12338
[Bugfix] Fix k_proj's bias for whisper self attention by @Isotr0py in #12342
[Kernel] Flash Attention 3 Support by @LucasWilkinson in #12093
[Doc] Troubleshooting errors during model inspection by @DarkLight1337 in #12351
[V1] Simplify M-RoPE by @ywang96 in #12352
[Bugfix] Fix broken internvl2 inference with v1 by @Isotr0py in #12360
[core] add wake_up doc and some sanity check by @youkaichao in #12361
[torch.compile] decouple compile sizes and cudagraph sizes by @youkaichao in #12243
[FP8][Kernel] Dynamic kv cache scaling factors computation by @gshtras in #11906
[TPU] Update TPU CI to use torchxla nightly on 20250122 by @lsy323 in #12334
[Docs] Document Phi-4 support by @Isotr0py in #12362
[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order by @dsikka in #11528
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script by @jsato8094 in #12357
[Docs] Add meetup slides by @WoosukKwon in #12345
[Docs] Update spec decode + structured output in compat matrix by @russellb in #12373
[V1][Frontend] Coalesce bunched RequestOutputs by @njhill in #12298
Set weights_only=True when using torch.load() by @russellb in #12366
[Bugfix] Path join when building local path for S3 clone by @omer-dayan in #12353
Update compressed-tensors version by @dsikka in #12367
[V1] Increase default batch size for H100/H200 by @WoosukKwon in #12369
[perf] fix perf regression from #12253 by @youkaichao in #12380
[Misc] Use VisionArena Dataset for VLM Benchmarking by @ywang96 in #12389
[ci/build] fix wheel size check by @youkaichao in #12396
[Hardware][Gaudi][Doc] Add missing step in setup instructions by @MohitIntel in #12382
[ci/build] sync default value for wheel size by @youkaichao in #12398
[Misc] Enable proxy support in benchmark script by @jsato8094 in #12356
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build by @LucasWilkinson in #12375
[Misc] Remove deprecated code by @DarkLight1337 in #12383
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). by @LucasWilkinson in #12405
[Bugfix][Kernel] Fix moe align block issue for mixtral by @ElizaWszola in #12413
[Bugfix] Fix BLIP-2 processing by @DarkLight1337 in #12412
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 by @divakar-amd in #12408
[Misc] Add FA2 support to ViT MHA layer by @Isotr0py in #12355
[TPU][CI] Update torchxla version in requirement-tpu.txt by @lsy323 in #12422
[Misc][Bugfix] FA3 support to ViT MHA layer by @ywang96 in #12435
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync by @youngkent in #12094
[V1][Bugfix] Fix assertion when mm hashing is turned off by @ywang96 in #12439
[Misc] Revert FA on ViT #12355 and #12435 by @ywang96 in #12445
[Frontend] Set server's maximum number of generated tokens using generation_config.json by @mhendrey in #12242
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 by @tlrmchlsmth in #12417
[Bugfix/CI] Fix broken kernels/test_mha.py by @tlrmchlsmth in #12450
[Bugfix][Kernel] Fix perf regression caused by PR #12405 by @LucasWilkinson in #12434
[Build/CI] Fix libcuda.so linkage by @tlrmchlsmth in #12424
[Frontend] Rerank API (Jina- and Cohere-compatible API) by @K-Mistele in #12376
[DOC] Add link to vLLM blog by @terrytangyuan in #12460
[V1] Avoid list creation in input preparation by @WoosukKwon in #12457
[Frontend] Support scores endpoint in run_batch by @pooyadavoodi in #12430
[Bugfix] Fix Granite 3.0 MoE model loading by @DarkLight1337 in #12446

New Contributors

@Chen-0210 made their first contribution in #11549
@ErezSC42 made their first contribution in #11209
@selalipop made their first contribution in #11561
@rajveerb made their first contribution in #11376
@hj-wei made their first contribution in #11515
@ayylemao made their first contribution in #11439
@JohnGiorgi made their first contribution in #6909
@sakunkun made their first contribution in #11565
@ApostaC made their first contribution in #11533
@houseroad made their first contribution in #11667
@serihiro made their first contribution in #11666
@CloseChoice made their first contribution in #11680
@chunyang-wen made their first contribution in #11689
@kathyyu-google made their first contribution in #10013
@bjmsong made their first contribution in #11688
@nathan-az made their first contribution in #11576
@SachinVarghese made their first contribution in #11694
@zinccat made their first contribution in #11708
@WangErXiao made their first contribution in #11656
@Bryce1010 made their first contribution in #11701
@bet0x made their first contribution in #11718
@yanburman made their first contribution in #11233
@RuixiangMa made their first contribution in #11751
@surajssd made their first contribution in #11679
@ys950902 made their first contribution in #11648
@XiaobingSuper made their first contribution in #11768
@jiangjiadi made their first contribution in #11795
@guspan-tanadi made their first contribution in #11878
@yeqcharlotte made their first contribution in #11875
@charlesfrye made their first contribution in #11907
@remimin made their first contribution in #11920
@frreiss made their first contribution in #11941
@shaochangxu made their first contribution in #11921
@Akshat-Tripathi made their first contribution in #11100
@liaoyanqing666 made their first contribution in #11979
@Concurrensee made their first contribution in #11947
@shen-shanshan made their first contribution in #11516
@e1ijah1 made their first contribution in #11982
@Yikun made their first contribution in #12013
@SunflowerAries made their first contribution in #12002
@maang-h made their first contribution in #12045
@rahul-tuli made their first contribution in #12057
@youngkent made their first contribution in #12020
@RunningLeon made their first contribution in #12037
@kewang-xlnx made their first contribution in #10765
@tvirolai-amd made their first contribution in #12087
@ice-tong made their first contribution in #11969
@madamczykhabana made their first contribution in #12152
@gujingit made their first contribution in #12109
@MartinGleize made their first contribution in #11442
@isikhi made their first contribution in #12227
@Jason-CKY made their first contribution in #12210
@andylolu2 made their first contribution in #11898
@codefromthecrypt made their first contribution in #12007
@zhenwei-intel made their first contribution in #12338
@MohitIntel made their first contribution in #12382
@mhendrey made their first contribution in #12242

Full Changelog: v0.6.6...v0.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0

Highlights

Features

Others

What's Changed

New Contributors

Contributors