All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added wheels for cuda 12.4
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nan
in the output when using atorch.Tensor
bias where a lot of consecutive keys are masked with-inf
- Update Flash-Attention version to
v2.6.3
when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
- fMHA: Removed
decoder
andsmall_k
backends - profiler: Removed
DetectSlowOpsProfiler
profiler - Removed compatibility with PyTorch < 2.4
Pre-built binary wheels require PyTorch 2.4.0
Pre-built binary wheels require PyTorch 2.4.0
Pre-built binary wheels require PyTorch 2.3.1
- fMHA:
PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in
triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions
- fMHA: Added
torch.compile
support for 3 biases (LowerTriangularMask
,LowerTriangularMaskWithTensorBias
andBlockDiagonalMask
) - some might require PyTorch 2.4 - fMHA: Added
torch.compile
support inmemory_efficient_attention
when passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))
) - fMHA:
memory_efficient_attention
now expects itsattn_bias
argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBias
subclasses are now constructed by default on thecuda
device if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_ste
for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
trigger
file in the profiling directory
- Removed support for PyTorch version older than 2.2
Pre-built binary wheels require PyTorch 2.3.0
- [2:4 sparsity] Added support for Straight-Through Estimator for
sparsify24
gradient (GRADIENT_STE
) - [2:4 sparsity]
sparsify24_like
now supports the cuSparseLt backend, and the STE gradient - Basic support for
torch.compile
for thememory_efficient_attention
operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup
Pre-built binary wheels require PyTorch 2.2.2
Pre-built binary wheels require PyTorch 2.2.1
- New
merge_attentions
function - fMHA: New gappy attention biases.
- fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
- fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
rope_padded
: Fixed CUDA error with many queries (more than 65k)rmsnorm
: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)
- fMHA: Removed triton operator (
fmha.triton.*
,xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp
,xformers.ops.TritonFlashAttentionOp
), as it has correctness issues under some conditions, and is slower than other implementations.
Pre-built binary wheels require PyTorch 2.2.0
- Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. Read more
- Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see
xformers.ops.sparsify24
.
- Make selective activation checkpointing be compatible with torch.compile.
- Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
- Removed support for PyTorch version older than 2.1.0
Pre-built binary wheels require PyTorch 2.1.1 (xFormers 0.0.23
) or PyTorch 2.1.2 (xFormers 0.0.23.post1
).
- fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with
length%64 == 1
- fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports
BlockDiagonalCausalWithOffsetPaddedKeysMask
- fMHA: Added
LocalAttentionFromBottomRightMask
(local) - fMHA: Added
LowerTriangularFromBottomRightMask
(causal) - fMHA: Added
LowerTriangularFromBottomRightLocalAttentionMask
(local + causal)
- Removed
xformers.triton.sum_strided
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to
memory_efficient_attention
, see the documentation for more details - fMHA: Added experimental support for Local Attention biases to
memory_efficient_attention
- Added an example of efficient LLaMa decoding using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info
now indicates the Flash-Attention version used
- fMHA: Removed
smallK
backend support for CPU.memory_efficient_attention
only works for CUDA/GPU tensors now - DEPRECATION: Many classes in
xformers.factory
,xformers.triton
andxformers.components
have been or will be deprecated soon (see tracking issue facebookresearch#848)
- fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix
attn_bias
stride overflow for very long sequences (>32k) LowerTriangularMask
is now backward compatible with older xformers versions
memory_efficient_attention
now expects theattn_bias
argument to have a head dimensionmemory_efficient_attention
no longer broadcasts the batch/head dimensions ofattn_bias
. Please use.expand
if you need to broadcast the bias- Remove
causal_diagonal
argument fromBlockDiagonalCausalWithOffsetPaddedKeysMask
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery
NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.
- fMHA/cutlass (backward): Massive performance improvements when
batch_size * num_heads
is low (10x+) - fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when
embed_dim>64
- fMHA: Updated Flash-Attention to
v1.0.5
- fMHA now runs on H100 (support is experimental)
- Display
nvcc
version used to compilexformers
inpython -m xformers.info
- Fixed performance regression with
nvcc>11.6
(facebookresearch#712) - fMHA/cutlass: Fixed
nan
in the output when using atorch.Tensor
with-inf
prefixes asattn_bias
(facebookresearch#722) - fMHA/cutlass: Fixed
nan
in the output when the sequence length is larger than2 ** 15
(facebookresearch#719) - fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
- fMHA/cutlass: The kernel are now deterministic
- fMHA/cutlass: Fixed backward pass correctness when using dropout (facebookresearch#724)
- Added
xformers.ops.index_select_cat
andxformers.ops.scaled_index_add
- those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance
- fMHA:
memory_efficient_attention
now acceptstorch.Tensor
as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see facebookresearch#683)
- fMHA: Fixed BW pass on Sm86/Sm89 GPUs when
K > 64
(RTX 3090, RTX 4090, A6000, ..) [facebookresearch#631]
- fMHA/CUTLASS: Added tensor attn bias support [facebookresearch#587] - contribution from @jfc4050
- fMHA/CUTLASS: Added tensor attn bias grad support [facebookresearch#587] - contribution from @jfc4050
- fMHA/CUTLASS: Added dropout support [facebookresearch#587] - contribution from @jfc4050
- fMHA: Added support for varying sequence lengths [facebookresearch#500]
- Updated triton dependency [facebookresearch#418]
- Stripe lineinfo from binaries, reducing the binary size [facebookresearch#549]
- Added support for pip wheels [facebookresearch#588, facebookresearch#573, facebookresearch#534, facebookresearch#523, ...] big thanks to @AbdBarho!
- Fixed compatibility with Python 3.7 [facebookresearch#541] - thanks to @susumuota
- fMHA: Fixed strides for QKV gradients for cutlass attention [facebookresearch#535]
- fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [facebookresearch#592]
- fMHA/Flash-Attention: Updated to https://github.com/HazyResearch/flash-attention/commit/a1f49a2b92b6fa022379bbebafed9d7f5e96a675 with multiple changes from @TriDao that make the operator up to 20% faster
- fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [facebookresearch#548]
- fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [facebookresearch#560]
- fMHA: Added Triton operator for forward pass from Flash-Attention authored by @TriDao, will be automatically used on A100 when compatible
- fMHA: Added
xformers.ops.memory_efficient_attention_forward
,xformers.ops.memory_efficient_attention_forward_requires_grad
,xformers.ops.memory_efficient_attention_backward
for power-users who write custom autograd functions [facebookresearch#560] - fMHA: Support for custom scaling for the CUTLASS-based kernel [facebookresearch#530] - contribution from @comaniac
- fMHA/CUTLASS: The current CUDA stream is now used by the kernel [facebookresearch#491]
- fMHA/CUTLASS: Improve overall performance
- SwiGLU: Added
xformers.ops.SwiGLU
and its functional counterpart (xformers.ops.swiglu
) [facebookresearch#490] - fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [facebookresearch#469] - improves performance on A100 for K = 128
- fMHA: Add custom
xformers.ops.unbind
operator to avoid a cat in the attention block [facebookresearch#458]
- fMHA: Added CUTLASS-based kernel for
xformers.ops.memory_efficient_attention
. This kernel is automatically depending on the inputs, and works on any GPU after P100 [facebookresearch#362]
- Removed duplicated biases in the FusedMLP layers [facebookresearch#317]
- Rotary embeddings respecting input types [facebookresearch#326]
- Poolformer style instantiating useless projection layers [facebookresearch#349]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [facebookresearch#348]
- Pass use_triton flag to LayerNorm module [facebookresearch#336]
- Four blocksparsity layouts from DeepSpeed [facebookresearch#320]
- Support several initialization options [facebookresearch#312]
- Conv2DFeedforward feedforward part [facebookresearch#321]
- VisualAttention [facebookresearch#329]
- Automatic blocksparse for causal attention [facebookresearch#334]
- Better hierarchical transformer generation [facebookresearch#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [facebookresearch#357]
- Refactor LRA code to use Pytorch Lightning [facebookresearch#343]
- Fix some torchscriptability [facebookresearch#246]
- Fix FourierMix being compatible with AMP [facebookresearch#258]
- Better asserts on QKV dimensions [facebookresearch#264]
- Better perfs for FusedMLP and FusedLinearLayer [facebookresearch#283]
- Deepnorm init missing self-attention [facebookresearch#284]
- Simplicial Embeddings [facebookresearch#259]
- Mem efficient attention, FW pass [facebookresearch#267]
- MHA benchmark
- MLP benchmark
- Move all triton kernels to triton v2 [facebookresearch#272]
- Mem efficient attention, BW pass [facebookresearch#281]
- Metaformer support [facebookresearch#294]
- Expose bias flag for feedforwards, same default as Timm [facebookresearch#220]
- Update eps value for layernorm, same default as torch [facebookresearch#221]
- PreNorm bugfix, only one input was normalized [facebookresearch#233]
- Fix bug where embedding dimensions that did not match model dim would lead to a crash [facebookresearch#244]
- Add DeepNet (DeepNorm) residual path and init [facebookresearch#227]
- Compositional Attention [facebookresearch#41]
- Experimental Ragged attention [facebookresearch#189]
- Mixture of Experts [facebookresearch#181]
- BlockSparseTensor [facebookresearch#202]
- Nd-tensor support for triton softmax [facebookresearch#210]
- Bugfix Favor, single feature map [facebookresearch#183]
- Sanity check blocksparse settings [facebookresearch#207]
- Fixed some picklability [facebookresearch#204]
- Much faster fused dropout [facebookresearch#164]
- Fused dropout repeatability [facebookresearch#173]
- Embedding weight tying option [facebookresearch#172]
- Dropout setting not properly passed in many attentions [facebookresearch#123]
- Fix self attention optimization not being triggered, broken residual path [facebookresearch#119]
- Improve speed by not using contiguous Tensors when not needed [facebookresearch#119]
- Attention mask wrapper [facebookresearch#113]
- ViT comparison benchmark [facebookresearch#117]
- Homogenizing the masks, additive or bool [facebookresearch#79][facebookresearch#85][facebookresearch#86]
- Fix causality flag not being respected [facebookresearch#103]
- Enabling FusedLayerNorm by default in the factory if Triton is available
- Fixing Favor with fp16
- Fixing Favor trainability
- Fused dropout/bias/activation layer [facebookresearch#58]
- Fused layernorm used by default in the factory [facebookresearch#92]
- Nystrom causal attention [facebookresearch#75]
- More robust blocksparse [facebookresearch#24]
- Rotary embeddings [facebookresearch#32]
- More flexible layernorm [facebookresearch#50]