Skip to content

Releases: NVIDIA/TransformerEngine

v1.12

18 Nov 22:52
Compare
Choose a tag to compare

Release Notes – Release 1.12

Key Features and Enhancements

  • [pyTorch] Added rotary_base argument for RoPE instead of hard-coding the value to 10000.
  • [pyTorch] Added support for the pool argument in the make_graphed_callables API.
  • [pyTorch] Made miscellaneous minor improvements to mitigate CPU overhead.
  • [pyTorch/C] Fixed window size calculation when using cuDNN attention backend.
  • [pyTorch] Expanded fused RoPE kernel support to include Context parallelism and “thd” qkv-format.
  • [pyTorch] Made flash-attn an optional dependency.
  • [JAX] Added support for sliding window attention.

Fixed Issues

  • [pyTorch/C] Fixed window size calculation when using cuDNN attention backend.
  • [pyTorch] Fixed miscellaneous bugs in the flash-attn version 3 backend.
  • [pyTorch] Fixed an issue using the flash-attn backend with Context Parallelism.
  • [pyTorch] Fixed a numerical error when using FP8 with activation recompute.
  • [pyTorch] Fixed an issue in the backward pass of the GroupedLinear class when weights don’t require gradient.
  • [JAX] Fixed a numerical bug in the cuDNN attention backend when using Context Parallelism.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.11

08 Oct 21:27
Compare
Choose a tag to compare

Release Notes – Release 1.11

Key Features and Enhancements

  • [pyTorch] Added dtensor support for optimizers.
  • [pyTorch] Added context parallel implementation with QKV all-to-all collectives.
  • [pyTorch] Added support for CPU offloading when using FP8 attention.
  • [pyTorch] Implemented padding and unpadding modules for FP8 that improve e2e performance of MoE models by ~2%.
  • [C/pyTorch] Added support for permutation operations for MoE and exposed them in the C API.
  • [pyTorch] Added support for RoPE when using FP8 attention.
  • [pyTorch] Added support for FlashAttention-3.
  • [JAX] Implemented context parallel fused attention using allgather and reduce-scatter collectives.

Fixed Issues

  • [pyTorch] Fixed a crash in fused adam optimizer when master parameters are not set.
  • [pyTorch] Fix a crash when using activation recompute with Python 3.10.
  • [pyTorch] Made miscellaneous fixes in the logic to select the correct attention backend.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.10

11 Sep 21:40
Compare
Choose a tag to compare

Release Notes – Release 1.10

Key Features and Enhancements

  • [pyTorch] Added an option to use keyword arguments with CUDA graphs.
  • [pyTorch] Implemented a new load-balanced offloading algorithm to utilize the CPU/GPU interconnect bandwidth to the maximum extent.
  • [pyTorch] Added support for multi-latent attention.
  • [pyTorch] Added additional documentation, scripts, and benchmarks for the attention backend.�
  • [pyTorch] Added context-parallel implementation with KV allgather for causal attention.
  • [pyTorch] Added support for data type casting in the fused Adam kernel.
  • [pyTorch] Added arguments for cumulative and maximum sequence lengths to the TransformerLayer and MultiheadAttention APIs.
  • [pyTorch] Added support for padding mask in unfused backend for dot product attention.
  • [pyTorch] Expanded operation support in the fusion API (transformer_engine.pytorch.ops).
  • [pyTorch] Made several improvements to reduce the amount CPU overhead during execution.
  • [PaddlePaddle] Added an option to run dot product attention deterministically.
  • [JAX] Added support for non-deterministic algorithms in the CUDNN flash attention backend for improved performance.

Fixed Issues

  • [pyTorch] Fixed miscellaneous bugs in communication-gemm overlap with userbuffers.
  • [pyTorch] Removed an additional copy of weights stored when using CPU offloading.
  • [pyTorch] Fixed a crash when running non-causal training with context parallelism.
  • [pyTorch] Fixed the calculation of tensor parallel size when using MQA/GQA.
  • [pyTorch] Fixed a crash when using context parallelism with the THD format.
  • [pyTorch] Fixed a crash in CUDA graphs when skipping warm-up iterations.
  • [pyTorch] Fixed a bug in TransformerLayer for the cross attention case where arguments were incorrectly propagated to DotProductAttention.
  • [C] Hid arbitrary symbols exposed globally in the shared object in order to avoid symbol conflict errors, which could cause a crash during library loading and imports.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.9

16 Aug 16:27
Compare
Choose a tag to compare

Release Notes – Release 1.9

Key Features and Enhancements

  • [PyTorch] Added support for sliding window attention in the cuDNN backend.
  • [PyTorch] Added an experimental torch.nn.Sequential style API for automatic operation based fusions.
  • [C/PyTorch] Added support for bottom-right aligned diagonal causal mask.
  • [C/PyTorch] Added support for grouped GEMM for MoE training.
  • [JAX] Added support for THD attention format.
  • [PaddlePaddle] Added support for CUDA graphs.
  • [PaddlePaddle] Added support for PaddlePaddle versions >= 2.6.1.

Fixed Issues

  • [PyTorch] Fixed incorrect outputs when handling non-contiguous input tensors.
  • [PyTorch] Fixed a hang in the initialize_ub function during multi-node runs, along with miscellaneous improvements in communication-GEMM overlap with userbuffers.
  • [PyTorch] Fixed convergence when using CPU offloading.
  • [PyTorch] Fixed a crash that occurred when using MoE, when an expert receives 0 tokens.
  • [JAX] Fixed a crash in newer JAX versions which restricted the output format of HLO lowering.
  • [PaddlePaddle] Fixed a crash when using the standalone column parallel linear API.
  • Fixed a numerical bug in the QGeLU activation.
  • Fixed a compilation bug in the core library with CUDA 12.1.
  • Fixed a bug selecting tuned RMSNorm kernels.
  • Fixed performance overheads by reducing the number of calls to the CUDA driver.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.8

25 Jul 00:24
Compare
Choose a tag to compare

#Release Notes – Release 1.8

Key Features and Enhancements

  • [pyTorch] Added a new argument, softmax_scale, to the DotProductAttention API.
  • [pyTorch] Extended Transformer Engine’s pyTorch build to always compile with tensor parallelism (TP) communication overlap support, and to remove MPI dependency. Also exposed the APIs initialize_ub and destroy_ub for communication-gemm overlap configuration.
  • [pyTorch] Improved documentation for the DotProductAttention API, including benchmarks and end-to-end test scripts.
  • [pyTorch] Incorporated the Fused Adam and Fused SGD optimizers into Transformer Engine. They previously had to be installed from the GitHub repository https://github.com/NVIDIA/apex.

Fixed Issues

  • [pyTorch] Made internal changes to reduce the amount of CPU overhead.
  • [pyTorch] Fixed a crash that occured when using TorchDynamo with the checkpoint API.
  • [pyTorch] Fixed an issue with loading an FP8 checkpoint when using FP8 attention.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.7

14 Jun 17:58
Compare
Choose a tag to compare

Release Notes – Release 1.7

Key Features and Enhancements

  • [JAX] Added support for SwiGLU, gated/non-gated ReLU, Quick GeLU, and squared ReLU activations.
  • [pyTorch] Added support for attention bias and various QKV formats when using context parallelism.
  • [pyTorch] Expanded the Linear API to handle zero input tokens for MoE-like use cases.
  • [pyTorch] Added support for upstream AMP (torch.amp.autocast) in the checkpoint API.
  • [pyTorch] Added squared-relu activation.
  • [pyTorch] Updated flash-attention support to version 2.5.8.
  • [paddle-paddle] Added support for gradient accumulation fusion.

Fixed Issues

  • [pyTorch] Fixed an uninitialized TP group error that could occur when training with certain tensor parallel configs.
  • [pyTorch] Fixed a bug that occured when loading a checkpoint with calibrated high-precision weights.
  • [pyTorch] Improved the documentation for attention mask.
  • [JAX] Fixed a bug with mismatching shapes of activations and corresponding sharding constraints.
  • [JAX] Fixed an internal bug which caused an incorrect shape to be passed for Layernorm gradient.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.6

13 May 16:36
Compare
Choose a tag to compare

Release Notes – Release 1.6

Key Features and Enhancements

  • [pyTorch] Added a new make_graphed_callables API call for NVIDIA® CUDA® graph capture, including FP8 support.
  • [pyTorch] Added beta support for two boolean arguments in the DelayedScaling FP8 recipe (fp8_dpa and fp8_mha) to support FP8 attention. Note that the API exposure of this feature may change in future releases.

Fixed Issues

  • [pyTorch] Fixed a numerical issue with storing weights in FP8 via the fp8_model_init API call.
  • [pyTorch] Fixed a bug that caused PyTorch modules to use excessive memory when training with frozen weights by storing unnecessary activations for the backward pass.
  • [JAX] Fixed a bug that caused an incorrect shape to be passed for LayerNorm gradient.

Known Issues in This Release

These issues are unchanged from the previous release.

FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (Dao-AILab/flash-attention#358). You can work around this issue by setting the environment variable MAX_JOBS=1 during Transformer Engine installation.

[pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

v1.5

17 Apr 17:41
Compare
Choose a tag to compare

Release Notes – Release 1.5

Key Features and Enhancements

  • [pyTorch] Added support for non-reentrant mode for activation recompute in the checkpoint API.
  • [pyTorch] Added support for rectangular matrices in the unfused softmax backend in order to support speculative decoding.
  • [pyTorch] Added the inference_params argument to the DotProductAttention API to support kv-caching.
  • [JAX] Added the DotProductAttention API.
  • [JAX] Expanded RoPE support using the rotary_pos_emb_group_method argument.
  • [paddle] Added support for RMSNorm.
  • [paddle] Added support for RoPE.
  • [paddle] Added support for SwiGLU.

Fixed Issues

  • [pyTorch] Fixed a numerical issue with storing weights in FP8 via the fp8_model_init API.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (Dao-AILab/flash-attention#358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation.
  • [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

  • [JAX] The arguments num_heads, dropout_rate, output_layernorm, apply_residual_connection_post_layernorm, and fuse_qkv are deprecated in the MultiHeadAttention API. They are replaced respectively with num_attention_heads, attention_dropout, input_layernorm, return_layernorm_output, and fused_qkv_params.

Miscellaneous Changes

There are no miscellaneous changes in this release.

v1.4

18 Mar 17:14
Compare
Choose a tag to compare

Release Notes – Release 1.4

Key Features and Enhancements

  • [C/pyTorch] Added support for QuickGELU activation.
  • [C/pyTorch] Added fused RoPE implementation for improved speedup.
  • [C/pyTorch] Added support for zero centered gamma in RMSNorm.
  • [C/pyTorch] Added support for alibi slopes to all attention backends.
  • [docs/pyTorch] Added a tutorial on accelerating HF Llama models with Transformer Engine.
  • [JAX] Added support for sequence parallelism.
  • [JAX] Added support for RoPE.
  • [JAX] Increased execution speed in GQA.
  • [paddle] Added support for grouped query attention (GQA).

Fixed Issues

  • [pyTorch] Fixed an issue where uninitialized/unused module buffers resulted in increased memory usage with the fp8_model_init API call.
  • [pyTorch] Fixed an issue in MultiheadAttention where the attention type was not properly passed down into granular API calls.
  • [pyTorch] Fixed an issue that caused Transformer Engine to crash when used with pyTorch version >= 2.0 and < 2.1.
  • [pyTorch] Fixed a convergence issue when using FP8 with activation recompute.
  • [pyTorch] Fixed a numerical bug associated with use of pipeline parallelism.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (Dao-AILab/flash-attention#358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation or by installing FlashAttention v1 (e.g. with the command pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
  • [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for the use case “cross attention with casual masking” when 2.1+ version of FlashAttentionA is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous Changes

FlashAttention v1 is not longer supported in Transformer Engine. Support for it was dropped in version 1.3. The minimum required FlashAttention version is v2.0.6.

v1.3

26 Feb 22:11
Compare
Choose a tag to compare

Release Notes – Release 1.3

Key Features and Enhancements

  • [pyTorch] Added support for deferred parameter initialization in several Transformer Engine modules via the device="meta" parameter:
    Linear
    LayerNorm
    RMSNorm
    LayerNormLinear
    LayerNormMLP
    MultiheadAttention
    TransformerLayer
  • [pyTorch] Added support for CPU offloading of weights and activations for tensors saved for the backward pass for additional memory savings.
  • [pyTorch] Added an additional attn_input_format parameter to TransformerLayer for the layout of the QKV tensor.
  • [pyTorch] Added support for non-tensor values of the forward parameter when using the checkpoint API call.
  • [PaddlePaddle] Added support for sequence parallelism.
  • [PaddlePaddle] Optimized memory usage for pipeline parallel training.
  • [JAX] Added support for grouped query attention (GQA).

Fixed Issues

  • [pyTorch] In LayerNormLinear and Linear, unused copies of weight and bias tensors were not deleted for the case when Q, K, and V tensors are fused.
  • [pyTorch] Faulty usage of pipeline parallelism with the FusedAttention backend.
  • [pyTorch] attention_type was not correctly passed from the MultiheadAttention call to the DotProductAttention call.
  • [pyTorch] Fused DPA backend reported bogus NaN errors during the backward pass.
  • [pyTorch] Crashes when running with PyTorch v2.0.1.
  • [pyTorch] Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see #600.
  • [JAX] Crashes when training in FP8 + FSDP.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (Dao-AILab/flash-attention#358). You can work around this issue by setting the environment variable MAX_JOBS=1 during Transformer Engine installation.
  • [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep the consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous Changes

FlashAttention v1 is no longer supported in Transformer Engine. The minimum required version is v2.0.6.