Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream @ v0.6.5 #265

Merged
merged 398 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
398 commits
Select commit Hold shift + click to select a range
9a88f89
custom allreduce + torch.compile (#10121)
SageMoore Nov 26, 2024
9406353
[Misc] Remove outdated init protocols (#10655)
DarkLight1337 Nov 26, 2024
334d64d
[ci] add vllm_test_utils (#10659)
youkaichao Nov 26, 2024
1f6584e
[V1] Enable profile for LLMEngine (#10665)
jikunshang Nov 26, 2024
db66e01
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
andoorve Nov 26, 2024
f5792c7
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
9a99273
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
7576cd3
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
2f0a0a1
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
0a71900
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
0a4d968
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
1bf905d
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
cfb3bf2
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
e85250b
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
15cc2a9
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
e225110
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
1209261
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
418cb3b
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9e0a147
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
308cc5e
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
c411def
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
b98c62b
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
197b448
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
9b4b150
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
395b1c7
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
cb4e1c3
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
70dc14f
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
278be67
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d9b4b3f
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
a79b122
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
9a8bff0
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
3ed5e73
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
5fc5ce0
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
8c1e77f
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
98f47f2
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
c83919c
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
fa6ecb9
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
c82b432
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
3132aac
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
661175b
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
40bc242
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
16ee07f
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-neuralmagic Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
edc4fa3
[ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
khluu Dec 9, 2024
3b61cb4
[V1] Further reduce CPU overheads in flash-attn (#10989)
WoosukKwon Dec 9, 2024
ca87149
[Misc][LoRA] Abstract PunicaWrapper (#10955)
jeejeelee Dec 9, 2024
a811dd6
[Model] merged input processor for Phi-3-Vision models (#10977)
Isotr0py Dec 9, 2024
cbcbdb1
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
kzawora-intel Dec 9, 2024
1a2f8fb
[v1] fix use compile sizes (#11000)
youkaichao Dec 9, 2024
9c6459e
[Neuron] Upgrade neuron to 2.20.2 (#11016)
xendo Dec 9, 2024
b63ba84
[ROCm][bugfix] scpecilative decoding worker class (#11035)
gshtras Dec 9, 2024
5ed5d5f
Build tpu image in release pipeline (#10936)
richardsliu Dec 9, 2024
6faec54
[V1] Do not store `None` in self.generators (#11038)
WoosukKwon Dec 9, 2024
6d52528
[Docs] Add dedicated tool calling page to docs (#10554)
mgoin Dec 10, 2024
d1f6d1c
[Model] Add has_weight to RMSNorm and re-enable weights loading track…
Isotr0py Dec 10, 2024
391d7b2
[Bugfix] Fix usage of `deprecated` decorator (#11025)
DarkLight1337 Dec 10, 2024
980ad39
[Frontend] Use request id from header (#10968)
joerunde Dec 10, 2024
bc192a2
[Pixtral] Improve loading (#11040)
patrickvonplaten Dec 10, 2024
28b3a1c
[V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
tlrmchlsmth Dec 10, 2024
ebf7780
monitor metrics of tokens per step using cudagraph batchsizes (#11031)
youkaichao Dec 10, 2024
e35879c
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig o…
sjuxax Dec 10, 2024
bfd6104
Update README.md (#11034)
dmoliveira Dec 10, 2024
82c73fd
[Bugfix] cuda error running llama 3.2 (#11047)
GeneDer Dec 10, 2024
fe2e10c
Add example of helm chart for vllm deployment on k8s (#9199)
mfournioux Dec 10, 2024
beb16b2
[Bugfix] Handle <|tool_call|> token in granite tool parser (#11039)
tjohnson31415 Dec 10, 2024
d05f886
[Misc][LoRA] Add PEFTHelper for LoRA (#11003)
jeejeelee Dec 10, 2024
9b9cef3
[Bugfix] Backport request id validation to v0 (#11036)
joerunde Dec 10, 2024
250ee65
[BUG] Remove token param #10921 (#11022)
flaviabeo Dec 10, 2024
e739194
[Core] Update to outlines >= 0.1.8 (#10576)
russellb Dec 10, 2024
75f89dc
[torch.compile] add a flag to track batchsize statistics (#11059)
youkaichao Dec 10, 2024
134810b
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 (#11061)
WoosukKwon Dec 10, 2024
9a93973
[Bugfix] Fix Mamba multistep (#11071)
tlrmchlsmth Dec 11, 2024
d5c5154
[Misc] LoRA + Chunked Prefill (#9057)
aurickq Dec 11, 2024
ffa48c9
[Model] PP support for Mamba-like models (#10992)
mzusman Dec 11, 2024
e39400a
Fix streaming for granite tool call when <|tool_call|> is present (#1…
maxdebayser Dec 11, 2024
2e33fe4
[CI/Build] Check transformers v4.47 (#10991)
DarkLight1337 Dec 11, 2024
3fb4b4f
[ci/build] Fix AMD CI dependencies (#11087)
khluu Dec 11, 2024
9974fca
[ci/build] Fix entrypoints test and pin outlines version (#11088)
khluu Dec 11, 2024
61b1d2f
[Core] v1: Use atexit to handle engine core client shutdown (#11076)
russellb Dec 11, 2024
2e32f5d
[Bugfix] Fix Idefics3 fails during multi-image inference (#11080)
B-201 Dec 11, 2024
40766ca
[Bugfix]: Clamp `-inf` logprob values in prompt_logprobs (#11073)
rafvasq Dec 11, 2024
8f10d5e
[Misc] Split up pooling tasks (#10820)
DarkLight1337 Dec 11, 2024
cad5c0a
[Doc] Update docs to refer to pooling models (#11093)
DarkLight1337 Dec 11, 2024
b2f7754
[CI/Build] Enable prefix caching test for AMD (#11098)
hissu-hyvarinen Dec 11, 2024
fd22220
[Doc] Installed version of llmcompressor for int8/fp8 quantization (#…
bingps Dec 11, 2024
91642db
[torch.compile] use depyf to dump torch.compile internals (#10972)
youkaichao Dec 11, 2024
d643c2a
[V1] Use input_ids as input for text-only models (#11032)
WoosukKwon Dec 11, 2024
66aaa77
[torch.compile] remove graph logging in ci (#11110)
youkaichao Dec 11, 2024
72ff3a9
[core] Bump ray to use _overlap_gpu_communication in compiled graph t…
ruisearch42 Dec 11, 2024
d1e21a9
[CI/Build] Split up VLM tests (#11083)
DarkLight1337 Dec 11, 2024
452a723
[V1][Core] Remove should_shutdown to simplify core process terminatio…
tlrmchlsmth Dec 11, 2024
4e11683
[V1] VLM preprocessor hashing (#11020)
alexm-neuralmagic Dec 12, 2024
7439a8b
[Bugfix] Multiple fixes to tool streaming with hermes and mistral (#1…
cedonley Dec 12, 2024
8fb26da
[Docs] Add media kit (#11121)
simon-mo Dec 12, 2024
24a36d6
Update link to LlamaStack remote vLLM guide in serving_with_llamastac…
terrytangyuan Dec 12, 2024
ccede2b
[Core] cleanup zmq ipc sockets on exit (#11115)
russellb Dec 12, 2024
1da8f0e
[Model] Add support for embedding model GritLM (#10816)
pooyadavoodi Dec 12, 2024
f092153
[V1] Use more persistent buffers to optimize input preparation overhe…
WoosukKwon Dec 12, 2024
8195824
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) (#1…
SanjuCSudhakaran Dec 12, 2024
62de37a
[core][distributed] initialization from StatelessProcessGroup (#10986)
youkaichao Dec 12, 2024
85362f0
[Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094)
Jeffwan Dec 12, 2024
4816d20
[V1] Fix torch profiling for offline inference (#11125)
ywang96 Dec 12, 2024
d4d5291
fix(docs): typo in helm install instructions (#11141)
ramonziai Dec 12, 2024
5d71257
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e2…
sjuxax Dec 12, 2024
2c97eca
[Misc] Validate grammar and fail early (#11119)
comaniac Dec 12, 2024
9f3974a
Fix logging of the vLLM Config (#11143)
JArnoldAMD Dec 12, 2024
db6c264
[Bugfix] Fix value unpack error of simple connector for KVCache trans…
ShangmingCai Dec 12, 2024
78ed8f5
[Misc][V1] Fix type in v1 prefix caching (#11151)
comaniac Dec 13, 2024
30870b4
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
ProExpertProg Dec 13, 2024
1efce68
[Bugfix] Use runner_type instead of task in GritLM (#11144)
pooyadavoodi Dec 13, 2024
3989a79
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quan…
dsikka Dec 13, 2024
00c1bde
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm (#11146)
gshtras Dec 13, 2024
34f1a80
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' (#11…
comaniac Dec 13, 2024
be39e3c
[core] clean up cudagraph batchsize padding logic (#10996)
youkaichao Dec 13, 2024
7cd7409
PaliGemma 2 support (#11142)
janimo Dec 13, 2024
f93bf2b
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.tx…
bigPYJ1151 Dec 13, 2024
eeec9e3
[Frontend] Separate pooling APIs in offline inference (#11129)
DarkLight1337 Dec 13, 2024
969da7d
[V1][VLM] Fix edge case bug for InternVL2 (#11165)
ywang96 Dec 13, 2024
d1fa714
[Refactor]A simple device-related refactor (#11163)
noemotiovon Dec 13, 2024
c31d4a5
[Core] support LoRA and prompt adapter in content-based hashing for B…
llsj14 Dec 13, 2024
5b0ed83
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in Allo…
zhangjf-nlp Dec 13, 2024
238c0d9
[Misc] Add tokenizer_mode param to benchmark_serving.py (#11174)
alexm-neuralmagic Dec 13, 2024
0920ab9
[Doc] Reorganize online pooling APIs (#11172)
DarkLight1337 Dec 13, 2024
0a56bcc
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169)
janimo Dec 13, 2024
0d8451c
[Distributed] Allow the placement group more time to wait for resourc…
Jeffwan Dec 13, 2024
4863e5f
[Core] V1: Use multiprocessing by default (#11074)
russellb Dec 14, 2024
4b5b8a6
[V1][Bugfix] Fix EngineCoreProc profile (#11185)
tlrmchlsmth Dec 14, 2024
9855aea
[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
comaniac Dec 14, 2024
24a3d12
update compressed-tensors to latest version (#11183)
dhuangnm Dec 14, 2024
4825926
[Core] Update outlines and increase its threadpool size (#11140)
russellb Dec 14, 2024
ea7bd68
[V1][Bugfix] Fix V1 TP trust-remote-code (#11182)
tlrmchlsmth Dec 14, 2024
3cb5769
[Misc] Minor improvements to the readability of PunicaWrapperBase (#1…
jeejeelee Dec 14, 2024
9c3dadd
[Frontend] Add `logits_processors` as an extra completion argument (#…
bradhilton Dec 14, 2024
93abf23
[VLM] Fully dynamic prompt replacement in merged input processor (#11…
DarkLight1337 Dec 14, 2024
6d917d0
Enable mypy checking on V1 code (#11105)
markmc Dec 14, 2024
8869368
[Performance][Core] Optimize the performance of evictor v1 and v2 by …
llsj14 Dec 14, 2024
15859f2
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201)
jeejeelee Dec 15, 2024
a1c0205
[torch.compile] allow tracking forward time (#11081)
youkaichao Dec 15, 2024
b10609e
[Misc] Clean up multi-modal processor (#11207)
DarkLight1337 Dec 15, 2024
96d673e
[Bugfix] Fix error handling of unsupported sliding window (#11213)
DarkLight1337 Dec 15, 2024
38e599d
[Doc] add documentation for disaggregated prefilling (#11197)
KuntaiDu Dec 15, 2024
d263bd9
[Core] Support disaggregated prefill with Mooncake Transfer Engine (#…
ShangmingCai Dec 15, 2024
25ebed2
[V1][Minor] Cache np arange to reduce input preparation overhead (#11…
WoosukKwon Dec 15, 2024
da6f409
Update deploying_with_k8s.rst (#10922)
AlexHe99 Dec 16, 2024
69ba344
[Bugfix] Fix block size validation (#10938)
chenqianfzh Dec 16, 2024
17138af
[Bugfix] Fix the default value for temperature in ChatCompletionReque…
yansh97 Dec 16, 2024
b3b1526
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 (#11212)
cennn Dec 16, 2024
bddbbcb
[Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203)
janimo Dec 16, 2024
d927dbc
[Model] Refactor Ultravox to use merged input processor (#11198)
Isotr0py Dec 16, 2024
2ca830d
[Doc] Reorder vision language examples in alphabet order (#11228)
Isotr0py Dec 16, 2024
efbce85
[misc] Layerwise profile updates (#10242)
varun-sundar-rabindranath Dec 16, 2024
551603f
[core] overhaul memory profiling and fix backward compatibility (#10511)
youkaichao Dec 16, 2024
35ffa68
[Docs] hint to enable use of GPU performance counters in profiling to…
bk-TurbaAI Dec 16, 2024
c301616
[ci][tests] add gh200 tests (#11244)
youkaichao Dec 16, 2024
88a412e
[torch.compile] fast inductor (#11108)
youkaichao Dec 17, 2024
35bae11
fix gh200 tests on main (#11246)
youkaichao Dec 17, 2024
0064f69
[CI] Add test case with JSON schema using references + use xgrammar b…
mgoin Dec 17, 2024
66d4b16
[Frontend] Add OpenAI API support for input_audio (#11027)
kylehh Dec 17, 2024
59c9b6e
[V1][VLM] Proper memory profiling for image language models (#11210)
ywang96 Dec 17, 2024
e88db68
[Platform] platform agnostic for EngineArgs initialization (#11225)
wangxiyuan Dec 17, 2024
2bfdbf2
[V1][Core] Use weakref.finalize instead of atexit (#11242)
tlrmchlsmth Dec 17, 2024
02222a0
[Misc] Kernel Benchmark for `RMSNorm` (#11241)
ywang96 Dec 17, 2024
f9ecbb1
[Misc] Allow passing logits_soft_cap for xformers backend (#11252)
Isotr0py Dec 17, 2024
2d1b9ba
[Bugfix] Fix request cancellation without polling (#11190)
joerunde Dec 17, 2024
438ea3c
Sync with upstream @ v0.6.5
dtrifiro Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
65 changes: 48 additions & 17 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,48 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

# Do not set -e, as the mixtral 8x22B model tends to crash occasionally
# and we still want to see other benchmarking results even when mixtral crashes.
set -x
set -o pipefail

check_gpus() {
Expand Down Expand Up @@ -85,11 +86,7 @@ kill_gpu_processes() {

ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3
pgrep python3 | xargs -r kill -9


# wait until GPU memory usage smaller than 1GB
Expand Down Expand Up @@ -289,7 +286,7 @@ run_serving_tests() {
# run the server
echo "Running test case $test_name"
echo "Server command: $server_command"
eval "$server_command" &
bash -c "$server_command" &
server_pid=$!

# wait until the server is alive
Expand Down Expand Up @@ -322,7 +319,7 @@ run_serving_tests() {
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"

eval "$client_command"
bash -c "$client_command"

# record the benchmarking commands
jq_output=$(jq -n \
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

Expand Down
33 changes: 31 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
Expand All @@ -18,11 +18,40 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
1 change: 0 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
Expand Down
44 changes: 3 additions & 41 deletions .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,49 +4,11 @@
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN="$HF_TOKEN" --name cpu-test cpu-test

function cpu_tests() {
set -e

# Run basic model test
docker exec cpu-test bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# online inference
docker exec cpu-test bash -c "
set -e
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
}
# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests"
25 changes: 16 additions & 9 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,27 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2 bash -c "
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
Expand All @@ -45,20 +46,26 @@ function cpu_tests() {
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
Expand All @@ -75,4 +82,4 @@ function cpu_tests() {

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests $CORE_RANGE"
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
25 changes: 25 additions & 0 deletions .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

# This script build the GH200 docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t gh200-test \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"

# Setup cleanup
remove_docker_container() { docker rm -f gh200-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference.py
'
2 changes: 1 addition & 1 deletion .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
7 changes: 5 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference.py
python3 examples/offline_inference_cli.py -tp 2
'
Loading
Loading