Transformer-engine

Latest version: v2.1.0

Safety actively analyzes 723200 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

2.0

Key Features and Enhancements

- [C] Added MXFP8 support in functions for casting, GEMMs, normalization, activations.
- [C] Added generic API for quantized tensors, including generic quantize and dequantize functions.
- [C] Exposed cuDNN LayerNorm and RMSNorm kernels.
- [pyTorch] Added MXFP8 recipe.
- [pyTorch] Added MXFP8 support in Linear, LayerNormLinear, LayerNormMLP, and TransformerLayer modules, and in the operation-based API.
- [pyTorch] Changed the default quantization scheme from FP8 to MXFP8 for Blackwell GPUs.
- [pyTorch] Added a custom tensor class for MXFP8 data.
- [pyTorch] Reduced CPU overhead in FP8/MXFP8 execution.
- [pyTorch] Enabled efficient handling of FP8 parameters with PyTorch FSDP2.
- [pyTorch] Expanded the support matrix for Sliding Window Attention.

Fixed Issues

- [pyTorch] Fixed bugs in capturing CUDA Graphs for MoE models.
- [pyTorch] Fixed errors with THE FP8 state when loading HuggingFace checkpoints.

Known Issues in This Release

- [pyTorch] Overlapping tensor-parallel communication with Userbuffers is not supported with MXFP8.
- [pyTorch] When running linear modules with MXFP8, the memory footprint and tensor-parallel communication volume is larger than necessary.
- [pyTorch] Userbuffers support in the operation-based API is disabled.

Breaking Changes in This Release

- [C] Updated minimum requirements to CUDA 12.1 and cuDNN 9.3.
- [PaddlePaddle] Removed PaddlePaddle integration.
- [pyTorch] Changed the default quantization from FP8 to MXFP8 for Blackwell GPUs.
- [pyTorch] Removed support for exporting ONNX models. Support for ONNX export will be reenabled in a future release

Deprecated Features

There are no deprecated features in this release.

1.13

Key Features and Enhancements

- [C/PyTorch/Jax] Added support for THD layout for MQA/GQA.
- [Jax] Expanded FFI (Foreign Function Interface) support to include quantization, transpose, layernorms, fused-attention, and CUDA graphs; fixed miscellaneous bugs in the existing FFI implementations.
- [Jax] Added support for Ring attention for context parallelism.
- [PyTorch] Expanded support for the Sequential/Operations Based API to include activations, communication overlap, normalizations, and other fusions.
- [PyTorch] Made miscellaneous fixes to reduce CPU overhead during execution.
- [PyTorch] Leveraged cuDNN 9.6+ to reduce memory usage for THD input format to attention.

Fixed Issues

- [PyTorch] Fixed a crash that could occur when using FlashAttention with context parallelism.
- [C/Jax] Adopted 64-bit offsets to fix overflow for large tensors in the cuDNN attention back end.
- [C/Jax] Fixed build when using clang compiler to build JAX native extensions.
- [PyTorch] Fixed a crash when importing transformer-engine in CPU-only systems.
- [PyTorch] Fixed a crash when using context parallelism with RoPE.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

- Transformer Engine support for the PaddlePaddle framework is deprecated, and will be fully removed in version 2.0.
- Support for exporting Transformer Engine modules via ONNX is deprecated, and will be removed in version 2.0. This feature will be supported again in a later minor release of version 2.

1.12

Key Features and Enhancements

- [pyTorch] Added rotary_base argument for RoPE instead of hard-coding the value to 10000.
- [pyTorch] Added support for the pool argument in the make_graphed_callables API.
- [pyTorch] Made miscellaneous minor improvements to mitigate CPU overhead.
- [pyTorch/C] Fixed window size calculation when using cuDNN attention backend.
- [pyTorch] Expanded fused RoPE kernel support to include Context parallelism and “thd” qkv-format.
- [pyTorch] Made flash-attn an optional dependency.
- [JAX] Added support for sliding window attention.

Fixed Issues

- [pyTorch/C] Fixed window size calculation when using cuDNN attention backend.
- [pyTorch] Fixed miscellaneous bugs in the flash-attn version 3 backend.
- [pyTorch] Fixed an issue using the flash-attn backend with Context Parallelism.
- [pyTorch] Fixed a numerical error when using FP8 with activation recompute.
- [pyTorch] Fixed an issue in the backward pass of the GroupedLinear class when weights don’t require gradient.
- [JAX] Fixed a numerical bug in the cuDNN attention backend when using Context Parallelism.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

1.11

Key Features and Enhancements

- [pyTorch] Added dtensor support for optimizers.
- [pyTorch] Added context parallel implementation with QKV all-to-all collectives.
- [pyTorch] Added support for CPU offloading when using FP8 attention.
- [pyTorch] Implemented padding and unpadding modules for FP8 that improve e2e performance of MoE models by ~2%.
- [C/pyTorch] Added support for permutation operations for MoE and exposed them in the C API.
- [pyTorch] Added support for RoPE when using FP8 attention.
- [pyTorch] Added support for FlashAttention-3.
- [JAX] Implemented context parallel fused attention using allgather and reduce-scatter collectives.

Fixed Issues

- [pyTorch] Fixed a crash in fused adam optimizer when master parameters are not set.
- [pyTorch] Fix a crash when using activation recompute with Python 3.10.
- [pyTorch] Made miscellaneous fixes in the logic to select the correct attention backend.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

1.10

Key Features and Enhancements

- [pyTorch] Added an option to use keyword arguments with CUDA graphs.
- [pyTorch] Implemented a new load-balanced offloading algorithm to utilize the CPU/GPU interconnect bandwidth to the maximum extent.
- [pyTorch] Added support for multi-latent attention.
- [pyTorch] Added additional documentation, scripts, and benchmarks for the attention backend.
- [pyTorch] Added context-parallel implementation with KV allgather for causal attention.
- [pyTorch] Added support for data type casting in the fused Adam kernel.
- [pyTorch] Added arguments for cumulative and maximum sequence lengths to the TransformerLayer and MultiheadAttention APIs.
- [pyTorch] Added support for padding mask in unfused backend for dot product attention.
- [pyTorch] Expanded operation support in the fusion API (transformer_engine.pytorch.ops).
- [pyTorch] Made several improvements to reduce the amount CPU overhead during execution.
- [PaddlePaddle] Added an option to run dot product attention deterministically.
- [JAX] Added support for non-deterministic algorithms in the CUDNN flash attention backend for improved performance.

Fixed Issues

- [pyTorch] Fixed miscellaneous bugs in communication-gemm overlap with userbuffers.
- [pyTorch] Removed an additional copy of weights stored when using CPU offloading.
- [pyTorch] Fixed a crash when running non-causal training with context parallelism.
- [pyTorch] Fixed the calculation of tensor parallel size when using MQA/GQA.
- [pyTorch] Fixed a crash when using context parallelism with the THD format.
- [pyTorch] Fixed a crash in CUDA graphs when skipping warm-up iterations.
- [pyTorch] Fixed a bug in TransformerLayer for the cross attention case where arguments were incorrectly propagated to DotProductAttention.
- [C] Hid arbitrary symbols exposed globally in the shared object in order to avoid symbol conflict errors, which could cause a crash during library loading and imports.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

1.9

Key Features and Enhancements

* [PyTorch] Added support for sliding window attention in the cuDNN backend.
* [PyTorch] Added an experimental torch.nn.Sequential style API for automatic operation based fusions.
* [C/PyTorch] Added support for bottom-right aligned diagonal causal mask.
* [C/PyTorch] Added support for grouped GEMM for MoE training.
* [JAX] Added support for THD attention format.
* [PaddlePaddle] Added support for CUDA graphs.
* [PaddlePaddle] Added support for PaddlePaddle versions >= 2.6.1.

Fixed Issues

* [PyTorch] Fixed incorrect outputs when handling non-contiguous input tensors.
* [PyTorch] Fixed a hang in the initialize_ub function during multi-node runs, along with miscellaneous improvements in communication-GEMM overlap with userbuffers.
* [PyTorch] Fixed convergence when using CPU offloading.
* [PyTorch] Fixed a crash that occurred when using MoE, when an expert receives 0 tokens.
* [JAX] Fixed a crash in newer JAX versions which restricted the output format of HLO lowering.
* [PaddlePaddle] Fixed a crash when using the standalone column parallel linear API.
* Fixed a numerical bug in the QGeLU activation.
* Fixed a compilation bug in the core library with CUDA 12.1.
* Fixed a bug selecting tuned RMSNorm kernels.
* Fixed performance overheads by reducing the number of calls to the CUDA driver.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Page 1 of 4

Releases

Has known vulnerabilities

Transformer-engine

Page 1 of 4

2.0

1.13

1.12

1.11

1.10

1.9

Page 1 of 4

Links

Releases