Skip to content

Commit

Permalink
Add BF16 support to custom PA (opendatahub-io#133)
Browse files Browse the repository at this point in the history
* tightened atol for custom PA; enable supported head size, block sizes in testing

* update num_blocks and num_iters in benchmark PA to realistic settings

* move to generic b16 type

* bf16 first port

* enabled all bf16 tests, set atol for bf16

* enable custom PA for bf16 as well as block size 32 and head size 64

* fix cast to zero in custom PA reduce

* py linter fixes

* clang format fixes

* div round up clang-format

---------

Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
  • Loading branch information
3 people authored Aug 14, 2024
1 parent 636ff01 commit d5bf9bc
Show file tree
Hide file tree
Showing 4 changed files with 271 additions and 157 deletions.
4 changes: 2 additions & 2 deletions benchmarks/kernels/benchmark_paged_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from vllm._custom_C import paged_attention_custom
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random

NUM_BLOCKS = 1024
NUM_BLOCKS = 1024 * 1024
PARTITION_SIZE = 256


Expand Down Expand Up @@ -176,7 +176,7 @@ def run_cuda_benchmark(num_iters: int, profile: bool = False) -> float:
if do_profile:
latency = run_benchmark(num_iters=1, profile=True)
else:
latency = run_benchmark(num_iters=100, profile=False)
latency = run_benchmark(num_iters=1000, profile=False)
print(f"Kernel running time: {latency * 1000000:.3f} us")


Expand Down
Loading

0 comments on commit d5bf9bc

Please sign in to comment.