Skip to content

Latest commit

 

History

History
461 lines (309 loc) · 41.9 KB

CHANGELOG.md

File metadata and controls

461 lines (309 loc) · 41.9 KB

Changelog

What's Changed

  • misc: addressing the package renaming issues by @yzh119 in #770
  • feat: support deepseek prefill attention shape by @yzh119 in #765
  • refactor: change the structure of attention updater by @yzh119 in #772
  • hotfix: follow up of #772 by @yzh119 in #773
  • bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
  • bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
  • ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
  • perf: refactor fa2 prefill template by @yzh119 in #776
  • feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
  • bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
  • misc: remove head dimension 64 from AOT by @yzh119 in #782
  • misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
  • bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
  • refactor: make group_size a part of params by @yzh119 in #786
  • bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
  • fix rope logic in mla decoding by @zhyncs in #793
  • Fix arguments of plan for split QK/VO head dims by @abmfy in #795
  • test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
  • bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
  • Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
  • feat: support f32 attention output in FA2 template by @yzh119 in #799
  • feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
  • bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
  • perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
  • bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
  • doc: add documentation to new MLA interface by @yzh119 in #811
  • feat: unlocking MLA for A100 by @yzh119 in #812
  • feat: cudagraph-compatible MLA API by @yzh119 in #813
  • feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
  • misc: fix sphinx by @abcdabcd987 in #815
  • bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
  • doc: improve mla related documentation by @yzh119 in #818

New Contributors

  • @abmfy made their first contribution in #795

What's Changed

  • ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in #694
  • bugfix: casting int array to int32 for rope input arguments by @yzh119 in #697
  • bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in #699
  • misc: remove release-please workflow by @yzh119 in #705
  • Customizable SM90 prefill kernels. by @hyhieu in #704
  • hotfix: revert torch.library register by @yzh119 in #709
  • Improve compatibility with pytorch 2.5 by @zifeitong in #711
  • misc: add bibtex reference by @yzh119 in #712
  • sampling: simplify min-p sampling by @yzh119 in #713
  • perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in #714
  • bugfix: fix min-p AOT compilation in #713 by @yzh119 in #717
  • Triton implementation of silu_and_mul by @nandor in #716
  • bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in #718
  • bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in #719
  • Finer-grained control over fp16/fp8 builds by @nandor in #722
  • Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in #728
  • ci: rename python package name to flashinfer-python by @yzh119 in #729
  • Add a note about int32/int64 datatypes to the kv_layout tutorial by @fergusfinn in #737
  • fix return type of cuBLAS by @zhyncs in #749
  • [Refactor] Unify JIT/Customization/AOT mode by @yzh119 in #748
  • Move allocations out of torch ops by @nandor in #740
  • [Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in #743
  • Filter out unsupported head dim for sm90 by @abcdabcd987 in #751
  • bugfix: various AOT issues by @abcdabcd987 in #752
  • [bugfix] Fix cpp tests/benchmarks by @yzh119 in #753
  • fix pin memory device by @youkaichao in #755
  • Add dev container for easier development by @ByronHsu in #680
  • hotfix: bugfix to #756 by @yzh119 in #757
  • Change apply_rope_with_cos_sin_cache to accept cos_sin_cache by @ByronHsu in #754
  • fix: match statement not supported in Python 3.8 by @xslingcn in #759
  • bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in #762
  • bugfix: Fix block-sparse attention API by @yzh119 in #767
  • Version bump: v0.2.0.post2 by @yzh119 in #768

New Contributors

  • @hyhieu made their first contribution in #704
  • @zifeitong made their first contribution in #711
  • @bobboli made their first contribution in #718
  • @timzsu made their first contribution in #728
  • @fergusfinn made their first contribution in #737
  • @LeiWang1999 made their first contribution in #743
  • @youkaichao made their first contribution in #755
  • @LLLLKKKK made their first contribution in #762

0.2.0.post1 (2024-12-22)

Bug Fixes

  • bug fix on determine_attention_backend condition (#688) (bcf7a3e)
  • accelerate plan speed of fa3 template (#690) (db8f04d)

0.2.0 (2024-12-17)

Release Blog

FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving

Features

  • add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
  • add a use_softmax field in variant class (#533) (d81af97)
  • add an option non_blocking to plan function (#622) (560af6f)
  • add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
  • add group size 3 to GQA decode dispatch (#558) (6227562)
  • add JIT compilation support for FA3 templates (#672) (d4e8d79)
  • allow the cascade kernels to be executed using varying sequence lengths (#627) (92ac440)
  • CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
  • fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
  • improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
  • JIT compilation (#507) (3613a5b)
  • modify group-gemm stage number (#497) (52dab1d)
  • non-contiguous query with paged kv cache (#553) (89f2c4a)
  • pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
  • simplify prefill JIT compilation (#605) (fe4f898)
  • specify gemm backend (#648) (0cc1a51)
  • support cached cos/sin in rope APIs (#585) (83e541d)
  • support huggingface transformer style rope interface (#568) (4f40420)
  • support sm90 cutlass group gemm (#509) (794bdda)
  • torch custom_op fix for rope (#569) (3e104bc)
  • torch custom_op support: norm (#552) (f6e0010)
  • torch.compile and custom_op support (#554) (9bf916f)
  • warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

Performance Improvements

  • accelerate JIT compilation speed (#618) (eaf73fd)
  • Dense and sparse customizable flashattention-3 template (#667) (51236c9)
  • fix prefill kernel performance degradation (step 1) (#602) (595cf60)
  • fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
  • improve parallelism in RoPE with pos_ids (#609) (ff05155)
  • improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
  • reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
  • reduce total_num_tiles_q by one (#644) (553ace5)
  • remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
  • speedup jit compilation of prefill attention kernels (#632) (a059586)
  • use cuda-core implementation for io-bound block-sparse attention (#560) (3fbf028)

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

plan/run

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

  • forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
  • begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
  • end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

  • All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

MultiLevelCascadeAttentionWrapper

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explanation.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

Refactor

  • refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

  • misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

  • slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
  • slight optimization on fragment layout swizzle (#458) (7c397cb)
  • use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.

0.1.5 (2024-08-13)

Bugfix

  • resolve cu121 compile wired issue (#446) (5f0159e)
  • Fix PagedPrefill python api and some typos (#441) (3fff008)
  • fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)

Features

  • decouple float and int workspace buffer (#442) (a7ee566)

Performance Improvements

  • faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.

0.1.4 (2024-08-09)

Features

Bug Fixes

  • fix dispatch fp16 type when enable fp8 (#430) (daa5566)
  • improve numerical stability of sampling kernels (#429) (898d8ea)

Other improvements

  • break up _kernels into multiple modules (#428) (8e482d9)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.

0.1.3 (2024-07-31)

Bugfix

  • bugfix: Fix cudagraph mode of BatchPrefillWithRaggedKVCacheWrapper (#412) (9907bc)
  • fix cu118 cub usage for sampling kernels (#410) (58d359)

MiscBreak up _kernels into multiple modules

  • enhance allocator error info and add shape check for prefill begin forward functions (#413) (5e36c5)

0.1.2 (2024-07-29)

Bugfix

Features

Performance Improvements

0.1.1 (2024-07-20)

Bugfix

  • fix the invalid kernel configuration for architectures with small shared memory size (#385) (cdac57)

Features

  • expose decoupled kv-cache to pytorch api (#383) (457a0ae)

Performance Improvements

0.1.0 (2024-07-17)

Features

  • Add mask to merge_state_in_place (#372) (e14fa81)
  • expose pytorch api for block sparse attention (#375) (4bba6fa)
  • Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)

0.0.9 (2024-07-12)

Bugfix

  • fix the decode kernel segfault in cudagraph mode (#368)(c69cfa)
  • fix decode kernels output for empty kv cache (#363)(ac72b1)
  • check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)

Performance Improvements

Acknowledgement

We thank @Yard1, @Ying1123 and @zhyncs for their contributions.

0.0.8 (2024-07-03)

Bugfix

  • fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
  • fix decode attention kernel with logits cap (#350) (f5f7a2)

0.0.7 (2024-06-28)

Breaking Changes

  • batch_decode_with_padded_kv_cache was removed, we encourage user to use BatchDecodeWithPagedKVCacheWrapper instead. (#343)

Bugfix

  • fix the forward_return_lse function in BatchPrefillWithRaggedKVCache class (#337)
  • fix the scheduler behavior of large page size (#333)

Features

Performance Improvements

0.0.6 (2024-06-21)

Bugfix

Fix some bug in v0.0.5 that might lead to crashes and instable performance.

Performance Improvements

  • use 1x4 warp layout for small query length (#322) (4e89b4d)

0.0.5 (2024-06-20)

Highlights

Acknowledgement

We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.

Refactor

  • support any GQA group size for tensor-cores kernels (#301) (c111ca)
  • support any page size for tensor-cores kernels (#306) (82fd8c)

Features

  • add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
  • add group gemm operators (#282) (e08ba42)
  • initial support of distributed operators (#289) (03553da)
  • initial support of logits hook (#298) (ab1e2ad)
  • Separate Q and KV dtypes for decode (#286) (5602659)
  • support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
  • support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
  • support custom attention mask in prefill/append attention kernels (#266) (7304282)
  • fused speculative sampilng kernels (#259) (cea2bb)
  • expose sampling APIs in pytorch (#238) (092902)

Performance Improvements

0.0.4 (2024-05-01)

Features

  • pytorch 2.3 support
  • gpu sampling kernels (top-p, top-k)
  • more gqa group sizes
  • add mma instructions for fp8 (#179) (d305798)
  • mma rowsum for fp8 (#180) (5af935c)
  • support any num_heads for get_alibi_slope (#200) (b217a6f)

Bug Fixes

  • fix python package dispatch error message (#182) (8eed01c)

0.0.3 (2024-03-08)

Features

Bug Fixes

Misc

  • add stream argument in BeginForwardFunction of TVMWrapper (#164) (fabfcb5)

Performance Improvements

  • multiple q by sm_scale in decode kernels (#144) (660c559)

0.0.2 (2024-02-17)

Bug Fixes