Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream @v0.6.6.post1 #279

Merged
merged 95 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
c77eb8a
[Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264)
mgoin Dec 18, 2024
bf8717e
[V1] Prefix caching for vision language models (#11187)
comaniac Dec 18, 2024
866fa45
[Bugfix] Restore support for larger block sizes (#11259)
kzawora-intel Dec 18, 2024
8b79f9e
[Bugfix] Fix guided decoding with tokenizer mode mistral (#11046)
wallashss Dec 18, 2024
f04e407
[MISC][XPU]update ipex link for CI fix (#11278)
yma11 Dec 18, 2024
60508ff
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
dsikka Dec 18, 2024
996aa70
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263)
Isotr0py Dec 18, 2024
362cff1
[CI][Misc] Remove Github Action Release Workflow (#11274)
simon-mo Dec 18, 2024
f954fe0
[FIX] update openai version (#11287)
jikunshang Dec 18, 2024
ca5f54a
[Bugfix] fix minicpmv test (#11304)
joerunde Dec 18, 2024
fdea8ec
[V1] VLM - enable processor cache by default (#11305)
alexm-neuralmagic Dec 18, 2024
5a9da2e
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2…
tlrmchlsmth Dec 19, 2024
17ca964
[Model] IBM Granite 3.1 (#11307)
tjohnson31415 Dec 19, 2024
a30482f
[CI] Expand test_guided_generate to test all backends (#11313)
mgoin Dec 19, 2024
c6b0a7d
[V1] Simplify prefix caching logic by removing `num_evictable_compute…
heheda12345 Dec 19, 2024
6142ef0
[VLM] Merged multimodal processor for Qwen2-Audio (#11303)
DarkLight1337 Dec 19, 2024
8936316
[Kernel] Refactor Cutlass c3x (#10049)
varun-sundar-rabindranath Dec 19, 2024
f26c4ae
[Misc] Optimize ray worker initialization time (#11275)
ruisearch42 Dec 19, 2024
9835673
[misc] benchmark_throughput : Add LoRA (#11267)
varun-sundar-rabindranath Dec 19, 2024
5aef498
[Feature] Add load generation config from model (#11164)
liuyanyi Dec 19, 2024
a0f7d53
[Bugfix] Cleanup Pixtral HF code (#11333)
DarkLight1337 Dec 19, 2024
6c7f881
[Model] Add JambaForSequenceClassification model (#10860)
yecohn Dec 19, 2024
7379b3d
[V1] Fix multimodal profiling for `Molmo` (#11325)
ywang96 Dec 19, 2024
e24113a
[Model] Refactor Qwen2-VL to use merged multimodal processor (#11258)
Isotr0py Dec 19, 2024
cdf22af
[Misc] Clean up and consolidate LRUCache (#11339)
DarkLight1337 Dec 19, 2024
276738c
[Bugfix] Fix broken CPU compressed-tensors test (#11338)
Isotr0py Dec 19, 2024
e461c26
[Misc] Remove unused vllm/block.py (#11336)
Ghjk94522 Dec 19, 2024
a985f7a
[CI] Adding CPU docker pipeline (#11261)
zhouyuan Dec 19, 2024
48edab8
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10…
Akashcodes732 Dec 20, 2024
7801f56
[ci][gh200] dockerfile clean up (#11351)
youkaichao Dec 20, 2024
b880ffb
[Misc] Add tqdm progress bar during graph capture (#11349)
mgoin Dec 20, 2024
86c2d8f
[Bugfix] Fix spec decoding when seed is none in a batch (#10863)
wallashss Dec 20, 2024
c954f21
[misc] add early error message for custom ops (#11355)
youkaichao Dec 20, 2024
1ecc645
[doc] backward compatibility for 0.6.4 (#11359)
youkaichao Dec 20, 2024
04139ad
[V1] Fix profiling for models with merged input processor (#11370)
ywang96 Dec 20, 2024
7c7aa37
[CI/Build] fix pre-compiled wheel install for exact tag (#11373)
dtrifiro Dec 20, 2024
995f562
[Core] Loading model from S3 using RunAI Model Streamer as optional l…
omer-dayan Dec 20, 2024
d573aea
[Bugfix] Don't log OpenAI field aliases as ignored (#11378)
mgoin Dec 20, 2024
5d2248d
[doc] explain nccl requirements for rlhf (#11381)
youkaichao Dec 20, 2024
47a0b61
Add ray[default] to wget to run distributed inference out of box (#11…
Jeffwan Dec 20, 2024
dd2b563
[V1][Bugfix] Skip hashing empty or None mm_data (#11386)
WoosukKwon Dec 21, 2024
51ff216
[Bugfix] update should_ignore_layer (#11354)
horheynm Dec 21, 2024
584f0ae
[V1] Make AsyncLLMEngine v1-v0 opaque (#11383)
rickyyx Dec 21, 2024
c2d1b07
[Bugfix] Fix issues for `Pixtral-Large-Instruct-2411` (#11393)
ywang96 Dec 21, 2024
29c7489
[CI] Fix flaky entrypoint tests (#11403)
ywang96 Dec 22, 2024
4a91397
[cd][release] add pypi index for every commit and nightly build (#11404)
youkaichao Dec 22, 2024
72d9c31
[cd][release] fix race conditions (#11407)
youkaichao Dec 22, 2024
f1d1bf6
[Bugfix] Fix fully sharded LoRAs with Mixtral (#11390)
n1hility Dec 22, 2024
048fc57
[CI] Unboock H100 Benchmark (#11419)
simon-mo Dec 22, 2024
f30581c
[misc][perf] remove old code (#11425)
youkaichao Dec 23, 2024
e51719a
mypy type checking for vllm/worker (#11418)
lucas-tucker Dec 23, 2024
5bfb30a
[Bugfix] Fix CFGGuide and use outlines for grammars that can't conver…
mgoin Dec 23, 2024
2e72668
[Bugfix] torch nightly version in ROCm installation guide (#11423)
terrytangyuan Dec 23, 2024
b866cdb
[Misc] Add assertion and helpful message for marlin24 compressed mode…
dsikka Dec 23, 2024
8cef6e0
[Misc] add w8a8 asym models (#11075)
dsikka Dec 23, 2024
63afbe9
[CI] Expand OpenAI test_chat.py guided decoding tests (#11048)
mgoin Dec 23, 2024
60fb4f3
[Bugfix] Add kv cache scales to gemma2.py (#11269)
mgoin Dec 23, 2024
94d545a
[Doc] Fix typo in the help message of '--guided-decoding-backend' (#1…
yansh97 Dec 23, 2024
32aa205
[Docs] Convert rST to MyST (Markdown) (#11145)
rafvasq Dec 23, 2024
a491d6f
[V1] TP Ray executor (#11107)
ruisearch42 Dec 23, 2024
4f074fb
[Misc]Suppress irrelevant exception stack trace information when CUDA…
shiquan1988 Dec 24, 2024
9edca6b
[Frontend] Online Pooling API (#11457)
DarkLight1337 Dec 24, 2024
b1b1038
[Bugfix] Fix Qwen2-VL LoRA weight loading (#11430)
jeejeelee Dec 24, 2024
7a5286c
[Bugfix][Hardware][CPU] Fix CPU `input_positions` creation for text-o…
Isotr0py Dec 24, 2024
461cde2
[OpenVINO] Fixed installation conflicts (#11458)
ilya-lavrenov Dec 24, 2024
5c79632
[attn][tiny fix] fix attn backend in MultiHeadAttention (#11463)
MengqingCao Dec 24, 2024
196c34b
[Misc] Move weights mapper (#11443)
jeejeelee Dec 24, 2024
409475a
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 (#11435)
terrytangyuan Dec 24, 2024
3f3e92e
[Model] Automatic conversion of classification and reward models (#11…
DarkLight1337 Dec 24, 2024
9832e55
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#1…
ruisearch42 Dec 25, 2024
fc60166
[Misc] Update disaggregation benchmark scripts and test logs (#11456)
Jeffwan Dec 25, 2024
b689ada
[Frontend] Enable decord to load video from base64 (#11492)
DarkLight1337 Dec 25, 2024
6ad909f
[Doc] Improve GitHub links (#11491)
DarkLight1337 Dec 25, 2024
51a624b
[Misc] Move some multimodal utils to modality-specific modules (#11494)
DarkLight1337 Dec 26, 2024
dbeac95
Mypy checking for vllm/compilation (#11496)
lucas-tucker Dec 26, 2024
aa25985
[Misc][LoRA] Fix LoRA weight mapper (#11495)
jeejeelee Dec 26, 2024
7492a36
[Doc] Add `QVQ` and `QwQ` to the list of supported models (#11509)
ywang96 Dec 26, 2024
dcb1a94
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 …
sroy745 Dec 26, 2024
f57ee56
[Model] Modify MolmoForCausalLM MLP (#11510)
jeejeelee Dec 26, 2024
eec906d
[Misc] Add placeholder module (#11501)
DarkLight1337 Dec 26, 2024
b85a977
[Doc] Add video example to openai client for multimodal (#11521)
Isotr0py Dec 26, 2024
720b10f
[1/N] API Server (Remove Proxy) (#11529)
robertgshaw2-neuralmagic Dec 26, 2024
2072924
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quanti…
mgoin Dec 26, 2024
55fb97f
[2/N] API Server: Avoid ulimit footgun (#11530)
robertgshaw2-neuralmagic Dec 26, 2024
f49777b
Deepseek v3 (#11502)
simon-mo Dec 27, 2024
82d24f7
[Docs] Document Deepseek V3 support (#11535)
simon-mo Dec 27, 2024
0c0c201
Update openai_compatible_server.md (#11536)
robertgshaw2-neuralmagic Dec 27, 2024
371d04d
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394)
WoosukKwon Dec 27, 2024
81b979f
[V1] Fix yapf (#11538)
WoosukKwon Dec 27, 2024
46d4359
[CI] Fix broken CI (#11543)
robertgshaw2-neuralmagic Dec 27, 2024
eb881ed
[misc] fix typing (#11540)
youkaichao Dec 27, 2024
1b875a0
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (…
robertgshaw2-neuralmagic Dec 27, 2024
2339d59
[BugFix] Fix quantization for all other methods (#11547)
robertgshaw2-neuralmagic Dec 27, 2024
39bd136
Sync with upstream @v0.6.6.post1
dtrifiro Jan 7, 2025
8171867
Dockerfile.rocm.ubi: bump torch/torchvision to 20241122 dev build
dtrifiro Jan 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
24 changes: 24 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
6 changes: 3 additions & 3 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,9 @@ steps:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
Expand Down
15 changes: 15 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,18 @@ steps:
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"

- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~

- label: "Build and publish CPU release image"
depends_on: block-cpu-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
env:
DOCKER_BUILDKIT: "1"
3 changes: 3 additions & 0 deletions .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
# It serves a sanity check for compilation and basic model usage.
set -ex

# Skip the new torch installation during build since we are using the specified version for arm64 in the Dockerfile
python3 use_existing_torch.py

# Try building the docker image
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
Expand Down
6 changes: 5 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -224,8 +224,12 @@ steps:
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/model_executor/layers
- vllm/model_executor/guided_decoding
- tests/test_logits_processor
command: pytest -v -s test_logits_processor.py
- tests/model_executor/test_guided_processors
commands:
- pytest -v -s test_logits_processor.py
- pytest -v -s model_executor/test_guided_processors.py

- label: Speculative decoding tests # 30min
source_file_dependencies:
Expand Down
30 changes: 29 additions & 1 deletion .buildkite/upload-wheels.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ wheel="$new_wheel"
version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
echo "Version: $version"

normal_wheel="$wheel" # Save the original wheel filename

# If the version contains "dev", rename it to v1.0.0.dev for consistency
if [[ $version == *dev* ]]; then
suffix="${version##*.}"
Expand All @@ -32,12 +34,38 @@ if [[ $version == *dev* ]]; then
new_version="1.0.0.dev"
fi
new_wheel="${wheel/$version/$new_version}"
mv -- "$wheel" "$new_wheel"
# use cp to keep both files in the artifacts directory
cp -- "$wheel" "$new_wheel"
wheel="$new_wheel"
version="$new_version"
fi

# Upload the wheel to S3
python3 .buildkite/generate_index.py --wheel "$normal_wheel"

# generate index for this commit
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"

if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
else
# only upload index.html for cu12 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
fi

# generate index for nightly
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"

if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
else
# only upload index.html for cu12 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
fi

aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
123 changes: 62 additions & 61 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,67 +39,68 @@ jobs:
const script = require('.github/workflows/scripts/create_release.js')
await script(github, context, core)

wheel:
name: Build Wheel
runs-on: ${{ matrix.os }}
needs: release

strategy:
fail-fast: false
matrix:
os: ['ubuntu-20.04']
python-version: ['3.9', '3.10', '3.11', '3.12']
pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt.
cuda-version: ['11.8', '12.1']

steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Setup ccache
uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
with:
create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}

- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
run: |
bash -x .github/workflows/scripts/env.sh

- name: Set up Python
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}

- name: Install CUDA ${{ matrix.cuda-version }}
run: |
bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}

- name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
run: |
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}

- name: Build wheel
shell: bash
env:
CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename)
asset_name=${wheel_name//"linux"/"manylinux1"}
echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV"
echo "asset_name=${asset_name}" >> "$GITHUB_ENV"

- name: Upload Release Asset
uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
upload_url: ${{ needs.release.outputs.upload_url }}
asset_path: ./dist/${{ env.wheel_name }}
asset_name: ${{ env.asset_name }}
asset_content_type: application/*
# NOTE(simon): No longer build wheel using Github Actions. See buildkite's release workflow.
# wheel:
# name: Build Wheel
# runs-on: ${{ matrix.os }}
# needs: release

# strategy:
# fail-fast: false
# matrix:
# os: ['ubuntu-20.04']
# python-version: ['3.9', '3.10', '3.11', '3.12']
# pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt.
# cuda-version: ['11.8', '12.1']

# steps:
# - name: Checkout
# uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

# - name: Setup ccache
# uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
# with:
# create-symlink: true
# key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}

# - name: Set up Linux Env
# if: ${{ runner.os == 'Linux' }}
# run: |
# bash -x .github/workflows/scripts/env.sh

# - name: Set up Python
# uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
# with:
# python-version: ${{ matrix.python-version }}

# - name: Install CUDA ${{ matrix.cuda-version }}
# run: |
# bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}

# - name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
# run: |
# bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}

# - name: Build wheel
# shell: bash
# env:
# CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
# run: |
# bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
# wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename)
# asset_name=${wheel_name//"linux"/"manylinux1"}
# echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV"
# echo "asset_name=${asset_name}" >> "$GITHUB_ENV"

# - name: Upload Release Asset
# uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
# env:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# with:
# upload_url: ${{ needs.release.outputs.upload_url }}
# asset_path: ./dist/${{ env.wheel_name }}
# asset_name: ${{ env.asset_name }}
# asset_content_type: application/*

# (Danielkinz): This last step will publish the .whl to pypi. Warning: untested
# - name: Publish package
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ instance/
docs/_build/
docs/source/getting_started/examples/*.rst
!**/*.template.rst
docs/source/getting_started/examples/*.md
!**/*.template.md

# PyBuilder
.pybuilder/
Expand Down
39 changes: 33 additions & 6 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")

# Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
set(CUTLASS_REVISION "v3.5.1" CACHE STRING "CUTLASS revision to use")
set(CUTLASS_REVISION "v3.6.0" CACHE STRING "CUTLASS revision to use")

# Use the specified CUTLASS source directory for compilation if VLLM_CUTLASS_SRC_DIR is provided
if (DEFINED ENV{VLLM_CUTLASS_SRC_DIR})
Expand All @@ -223,13 +223,13 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
FetchContent_Declare(
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
GIT_TAG v3.5.1
GIT_TAG 8aa95dbb888be6d81c6fbf7169718c5244b53227
GIT_PROGRESS TRUE

# Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
# Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
# So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
GIT_SHALLOW TRUE
GIT_SHALLOW FALSE
)
endif()
FetchContent_MakeAvailable(cutlass)
Expand All @@ -241,7 +241,10 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/quantization/awq/gemm_kernels.cu"
"csrc/custom_all_reduce.cu"
"csrc/permute_cols.cu"
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu")
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_compressor_entry.cu"
"csrc/cutlass_extensions/common.cpp")

set_gencode_flags_for_srcs(
SRCS "${VLLM_EXT_SRC}"
Expand Down Expand Up @@ -270,7 +273,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
" in CUDA target architectures")
endif()

#
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}")
Expand Down Expand Up @@ -323,6 +325,31 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()

#
# 2:4 Sparse Kernels

# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0/9.0a for now).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SPARSE_SCALED_MM_C3X=1")
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
message(STATUS "Not building sparse_scaled_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.2, we recommend upgrading to CUDA 12.2 or later "
"if you intend on running FP8 sparse quantized models on Hopper.")
else()
message(STATUS "Not building sparse_scaled_mm_c3x as no compatible archs found "
"in CUDA target architectures")
endif()
endif()


#
# Machete kernels
Expand Down Expand Up @@ -404,7 +431,7 @@ define_gpu_extension_target(
SOURCES ${VLLM_EXT_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
USE_SABI 3
WITH_SOABI)

Expand Down
Loading
Loading