Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
`v0.0.29.post2` - build for PyTorch 2.6.0
Pre-built binary wheels are available for PyTorch 2.6.0. Following PyTorch, we build wheels for CUDA 11.8, 12.4, and 12.6 only (we no longer build for CUDA 12.1).
xFormers now requires PyTorch >= 2.6
[v0.0.29.post1] Fix Flash2 on windows
This fixes the issue reported in #1163 (comment)
Enabling FAv3 by default, removed deprecated components
Pre-built binary wheels require PyTorch 2.5.1
Improved:
- [fMHA] Creating a
LowerTriangularMask
no longer creates a CUDA tensor - [fMHA] Updated Flash-Attention to
v2.7.2.post1
- [fMHA] Flash-Attention v3 will now be used by
memory_efficient_attention
by default when available, unless the operator is enforced with theop
keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s - [fMHA] Fixed a performance regression with the
cutlass
backend for the backward pass (#1176) - mostly used on older GPUs (eg V100) - Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
- Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)
Removed:
- Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
- Removed unmaintained/deprecated components in
xformers.components.*
(see #848)
`v0.0.28.post3` - build for PyTorch 2.5.1
[0.0.28.post3] - 2024-10-30
Pre-built binary wheels require PyTorch 2.5.1
`v0.0.28.post2` - build for PyTorch 2.5.0
[0.0.28.post2] - 2024-10-18
Pre-built binary wheels require PyTorch 2.5.0
`0.0.28.post1` - fixing upload for cuda 12.4 wheels
[0.0.28.post1] - 2024-09-13
Properly upload wheels for cuda 12.4
FAv3, profiler update & AMD
Pre-built binary wheels require PyTorch 2.4.1
Added
- Added wheels for cuda 12.4
- Added conda builds for python 3.11
- Added wheels for rocm 6.1
Improved
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nan
in the output when using atorch.Tensor
bias where a lot of consecutive keys are masked with-inf
- Update Flash-Attention version to
v2.6.3
when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
Removed
- fMHA: Removed
decoder
andsmall_k
backends - profiler: Removed
DetectSlowOpsProfiler
profiler - Removed compatibility with PyTorch < 2.4
- Removed conda builds for python 3.9
- Removed windows pip wheels for cuda 12.1 and 11.8
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
[v0.0.27] torch.compile support, bug fixes & more
Added
- fMHA:
PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in
triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions
- fMHA: Added
torch.compile
support for 3 biases (LowerTriangularMask
,LowerTriangularMaskWithTensorBias
andBlockDiagonalMask
) - some might require PyTorch 2.4 - fMHA: Added
torch.compile
support inmemory_efficient_attention
when passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))
) - fMHA:
memory_efficient_attention
now expects itsattn_bias
argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBias
subclasses are now constructed by default on thecuda
device if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_ste
for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
trigger
file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0