Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with [email protected] #65

Merged
merged 162 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
162 commits
Select commit Hold shift + click to select a range
5b15bde
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
f1e72cc
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
7187507
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
f5dda63
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
cf90ae0
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
9c62db0
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
ff9ddbc
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
0cbc1d2
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
8c00f9c
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
832ea88
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
6c916ac
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
5d4d905
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
edd5fe5
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
c246212
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
e72dc6c
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
1744cc9
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
ba991d5
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
e9de9dd
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
f23871e
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
2ce5d66
[Speculative Decoding] Support draft model on different tensor-paral…
wooyeonlee0 Jun 25, 2024
7b99314
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
67882db
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
c18ebfd
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
d9b34ba
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
dd248f7
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
bc34937
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
dd793d1
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improv…
mawong-amd Jun 25, 2024
f178e56
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
c2a8ac7
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
8207972
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
dda4811
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
3aa7b6c
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
515080a
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
6806998
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
3439c5a
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
6984c02
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
5bfd1bb
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
c54269d
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
cbc53b6
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
f5c8628
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
38a1674
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
294104c
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
b9e8425
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
2110557
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
6eabc6c
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
d12af20
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
96354d6
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
2061f0b
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
e9d32d0
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
98cf2ed
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
3fd02bd
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
691e29e
[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5…
njhill Jun 27, 2024
365791f
[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849)
njhill Jun 27, 2024
736ed38
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922)
ywang96 Jun 27, 2024
79c92c7
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
64e8d2a
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
c3dde36
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
f136da1
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
74d55c0
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
0d0e3a4
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
5cbe8d1
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
5932634
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
57f09a4
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
ec1ad00
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
3b752a6
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
b90d8cd
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
b2c6202
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
6a2d659
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
b185230
[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Sim…
robertgshaw2-neuralmagic Jun 28, 2024
2cd402e
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP…
robertgshaw2-neuralmagic Jun 28, 2024
be0b3af
Support Deepseek-V2 (#4650)
zwd003 Jun 28, 2024
4bf35ed
[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is en…
mgoin Jun 28, 2024
5d2a1a9
Unmark more files as executable (#5962)
tlrmchlsmth Jun 28, 2024
6a62cb8
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadEr…
robertgshaw2-neuralmagic Jun 28, 2024
7041de4
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for …
LiuXiaoxuanPKU Jun 28, 2024
54814fd
[Bugfix][TPU] Fix TPU sampler output (#5978)
WoosukKwon Jun 29, 2024
7f83f40
[Bugfix][TPU] Fix pad slot id (#5977)
WoosukKwon Jun 29, 2024
c4bca74
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
mcalman Jun 29, 2024
906a19c
[Misc] Extend vLLM Metrics logging API (#5925)
SolitaryThinker Jun 29, 2024
ba49944
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
joerunde Jun 29, 2024
580353d
[Bugfix] Fix precisions in Gemma 1 (#5913)
WoosukKwon Jun 29, 2024
329df38
[Misc] Update Phi-3-Vision Example (#5981)
ywang96 Jun 29, 2024
51e971d
[Bugfix] Support `eos_token_id` from `config.json` (#5954)
DarkLight1337 Jun 29, 2024
7c01f70
[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum …
Yard1 Jun 29, 2024
f7dac83
[Kernel] Raise an exception in MoE kernel if the batch size is larger…
comaniac Jun 29, 2024
8dbfcd3
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
robertgshaw2-neuralmagic Jun 29, 2024
99397da
[CI/Build] Add TP test for vision models (#5892)
DarkLight1337 Jun 29, 2024
75aa144
[ CI/Build ] LM Eval Harness Based CI Testing (#5838)
robertgshaw2-neuralmagic Jun 29, 2024
9def106
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix…
mawong-amd Jun 29, 2024
bcc6a09
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989)
ywang96 Jun 30, 2024
cff6a1f
[CI/Build] Reuse code for checking output consistency (#5988)
DarkLight1337 Jun 30, 2024
9d47f64
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
DarkLight1337 Jun 30, 2024
2be6955
[ci][distributed] fix device count call
youkaichao Jun 30, 2024
c6c240a
[Frontend]: Support base64 embedding (#5935)
llmpros Jun 30, 2024
f5e73c9
[Lora] Use safetensor keys instead of adapter_config.json to find une…
rkooo567 Jun 30, 2024
deacb7e
[ CI ] Temporarily Disable Large LM-Eval Tests (#6005)
robertgshaw2-neuralmagic Jun 30, 2024
7836fdc
[Misc] Fix `get_min_capability` (#5971)
dsikka Jun 30, 2024
af9ad46
[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify …
robertgshaw2-neuralmagic Jun 30, 2024
614aa51
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
youkaichao Jul 1, 2024
80ca1e6
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into…
sroy745 Jul 1, 2024
d76084c
[ CI ] Re-enable Large Model LM Eval (#6031)
robertgshaw2-neuralmagic Jul 1, 2024
4050d64
[doc][misc] remove deprecated api server in doc (#6037)
youkaichao Jul 1, 2024
bb60326
[Misc] update benchmark backend for scalellm (#6018)
zhyncs Jul 1, 2024
8893130
[doc][misc] further lower visibility of simple api server (#6041)
youkaichao Jul 1, 2024
dec6fc6
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizer…
Yard1 Jul 1, 2024
12a5995
[Bugfix] adding chunking mechanism to fused_moe to handle large input…
avshalomman Jul 1, 2024
83bdcb6
add FAQ doc under 'serving' (#5946)
llmpros Jul 1, 2024
8e0817c
[Bugfix][Doc] Fix Doc Formatting (#6048)
ywang96 Jul 1, 2024
c4059ea
[Bugfix] Add explicit `end_forward` calls to flashinfer (#6044)
Yard1 Jul 1, 2024
c87ebc3
[BugFix] Ensure worker model loop is always stopped at the right time…
njhill Jul 1, 2024
e373853
[Frontend] Relax api url assertion for openai benchmarking (#6046)
jamestwhedbee Jul 1, 2024
5460070
[Model] Changes to MLPSpeculator to support tie_weights and input_sca…
tdoublep Jul 1, 2024
3476ed0
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 defa…
alexm-neuralmagic Jul 2, 2024
2c37540
[Frontend] Add template related params to request (#5709)
danieljannai21 Jul 2, 2024
98d6682
[VLM] Remove `image_input_type` from VLM config (#5852)
xwjiang2010 Jul 2, 2024
31354e5
[Doc] Reinstate doc dependencies (#6061)
DarkLight1337 Jul 2, 2024
15aba08
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#…
sirejdua Jul 2, 2024
c5832d2
[Core] Pipeline Parallel Support (#4412)
andoorve Jul 2, 2024
4d26d80
Update conftest.py (#6076)
robertgshaw2-neuralmagic Jul 2, 2024
7c008c5
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970)
robertgshaw2-neuralmagic Jul 2, 2024
ee93f4f
[CORE] Quantized lm-head Framework (#4442)
Qubitium Jul 2, 2024
9d6a8da
[Model] Jamba support (#4115)
mzusman Jul 2, 2024
482045e
[hardware][misc] introduce platform abstraction (#6080)
youkaichao Jul 3, 2024
9831aec
[Core] Dynamic image size support for VLMs (#5276)
DarkLight1337 Jul 3, 2024
d18bab3
[CI] Fix base url doesn't strip "/" (#6087)
rkooo567 Jul 3, 2024
d830656
[BugFix] Avoid unnecessary Ray import warnings (#6079)
njhill Jul 3, 2024
f666207
[misc][distributed] error on invalid state (#6092)
youkaichao Jul 3, 2024
3a86b54
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091)
ywang96 Jul 3, 2024
f1c7813
[Doc] Fix Mock Import (#6094)
ywang96 Jul 3, 2024
7cd2ebb
[Bugfix] Fix `compute_logits` in Jamba (#6093)
ywang96 Jul 3, 2024
47f0954
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
mgoin Jul 3, 2024
3c6325f
[core][distributed] custom allreduce when pp size > 1 (#6117)
youkaichao Jul 3, 2024
d9e98f4
[vlm] Remove vision language config. (#6089)
xwjiang2010 Jul 3, 2024
62963d1
[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113)
robertgshaw2-neuralmagic Jul 3, 2024
966fe72
[doc][misc] bump up py version in installation doc (#6119)
youkaichao Jul 3, 2024
3de6e6a
[core][distributed] support n layers % pp size != 0 (#6115)
youkaichao Jul 3, 2024
1dab9bc
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109)
tjohnson31415 Jul 3, 2024
0ed646b
[Distributed][Core] Support Py39 and Py38 for PP (#6120)
andoorve Jul 4, 2024
3dd5070
[CI/Build] Cleanup VLM tests (#6107)
DarkLight1337 Jul 4, 2024
56b325e
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash at…
gshtras Jul 4, 2024
27902d4
[misc][doc] try to add warning for latest html (#5979)
youkaichao Jul 4, 2024
81d7a50
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008)
zhouyuan Jul 4, 2024
69ec3ca
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
LiuXiaoxuanPKU Jul 4, 2024
ae96ef8
[VLM] Calculate maximum number of multi-modal tokens by model (#6121)
DarkLight1337 Jul 4, 2024
a41357e
[VLM] Improve consistency between feature size calculation and dummy …
ywang96 Jul 5, 2024
ea4b570
[VLM] Cleanup validation and update docs (#6149)
DarkLight1337 Jul 5, 2024
0097bb1
[Bugfix] Use templated datasource in grafana.json to allow automatic …
frittentheke Jul 5, 2024
f1e15da
[Frontend] Continuous usage stats in OpenAI completion API (#5742)
jvlunteren Jul 5, 2024
e58294d
[Bugfix] Add verbose error if scipy is missing for blocksparse attent…
JGSweets Jul 5, 2024
abad574
bump version to v0.5.1 (#6157)
simon-mo Jul 5, 2024
79d406e
[Docs] Fix readthedocs for tag build (#6158)
simon-mo Jul 5, 2024
2de490d
Update wheel builds to strip debug (#6161)
simon-mo Jul 5, 2024
f025062
Fix release wheel build env var (#6162)
simon-mo Jul 5, 2024
bc96d5c
Move release wheel env var to Dockerfile instead (#6163)
simon-mo Jul 6, 2024
175c43e
[Doc] Reorganize Supported Models by Type (#6167)
ywang96 Jul 6, 2024
9389380
[Doc] Move guide for multimodal model and other improvements (#6168)
DarkLight1337 Jul 6, 2024
6206dcb
[Model] Add PaliGemma (#5189)
ywang96 Jul 7, 2024
333306a
add benchmark for fix length input and output (#5857)
haichuan1221 Jul 7, 2024
abfe705
[ Misc ] Support Fp8 via `llm-compressor` (#6110)
robertgshaw2-neuralmagic Jul 7, 2024
3b08fe2
[misc][frontend] log all available endpoints (#6195)
youkaichao Jul 7, 2024
16620f4
do not exclude `object` field in CompletionStreamResponse (#6196)
kczimm Jul 8, 2024
717f4bc
Feature/add benchmark testing (#5947)
haichuan1221 Jul 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 0 additions & 4 deletions .buildkite/download-images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,6 @@ set -o pipefail
# aws s3 sync s3://air-example-data-2/vllm_opensource_llava/ images/
mkdir -p images
cd images
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom.jpg

Expand Down
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.752
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.728
- name: "exact_match,flexible-extract"
value: 0.728
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.86
- name: "exact_match,flexible-extract"
value: 0.86
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.624
- name: "exact_match,flexible-extract"
value: 0.624
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.792
- name: "exact_match,flexible-extract"
value: 0.824
limit: 250
num_fewshot: 5
3 changes: 3 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
4 changes: 4 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.2

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
59 changes: 59 additions & 0 deletions .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash

usage() {
echo``
echo "Runs lm eval harness on GSM8k using vllm and compares to "
echo "precomputed baseline (measured by HF transformers.)"
echo
echo "usage: ${0} <options>"
echo
echo " -c - path to the test data config (e.g. configs/small-models.txt)"
echo " -t - tensor parallel size"
echo
}

SUCCESS=0

while getopts "c:t:" OPT; do
case ${OPT} in
c )
CONFIG="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
LOCAL_SUCCESS=0

echo "=== RUNNING MODEL: $MODEL_CONFIG WITH TP SIZE: $TP_SIZE==="

export LM_EVAL_TEST_DATA_FILE=$PWD/configs/${MODEL_CONFIG}
export LM_EVAL_TP_SIZE=$TP_SIZE
pytest -s test_lm_eval_correctness.py || LOCAL_SUCCESS=$?

if [[ $LOCAL_SUCCESS == 0 ]]; then
echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
else
echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
fi

SUCCESS=$((SUCCESS + LOCAL_SUCCESS))

done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
55 changes: 55 additions & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
"""

import os
from pathlib import Path

import lm_eval
import numpy
import yaml

RTOL = 0.02
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)


def launch_lm_eval(eval_config):
model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}," \
f"add_bos_token=true"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
2 changes: 1 addition & 1 deletion .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- block: "Build wheels"

- label: "Build wheel - Python {{matrix.python_version}}, CUDA {{matrix.cuda_version}}"
- label: "Build wheel - Python {{matrix.python_version}}, CUDA {{matrix.cuda_version}}"
agents:
queue: cpu_queue
commands:
Expand Down
8 changes: 5 additions & 3 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run -itd -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 --cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test
docker run -itd -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 --cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test-avx2 cpu-test-avx2
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test-avx2 cpu-test-avx2

# offline inference
docker exec cpu-test bash -c "python3 examples/offline_inference.py"
Expand All @@ -23,4 +25,4 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"
docker exec cpu-test bash -c "cd tests;
pip install pytest Pillow protobuf
cd ../
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py"
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py" # Mamba on CPU is not supported
14 changes: 14 additions & 0 deletions .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .

# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
Loading
Loading