Transformer-engine

Latest version: v1.11.0

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

1.5

Key Features and Enhancements

- [pyTorch] Added support for non-reentrant mode for activation recompute in the `checkpoint` API.
- [pyTorch] Added support for rectangular matrices in the unfused softmax backend in order to support speculative decoding.
- [pyTorch] Added the `inference_params` argument to the `DotProductAttention` API to support kv-caching.
- [JAX] Added the `DotProductAttention` API.
- [JAX] Expanded RoPE support using the `rotary_pos_emb_group_method` argument.
- [paddle] Added support for RMSNorm.
- [paddle] Added support for RoPE.
- [paddle] Added support for SwiGLU.

Fixed Issues

- [pyTorch] Fixed a numerical issue with storing weights in FP8 via the `fp8_model_init` API.

Known Issues in This Release

- FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation.
- [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

- [JAX] The arguments `num_heads`, `dropout_rate`, `output_layernorm`, `apply_residual_connection_post_layernorm`, and `fuse_qkv` are deprecated in the `MultiHeadAttention` API. They are replaced respectively with `num_attention_heads`, `attention_dropout`, `input_layernorm`, `return_layernorm_output`, and `fused_qkv_params`.

Miscellaneous Changes

There are no miscellaneous changes in this release.

1.4

Key Features and Enhancements

- [C/pyTorch] Added support for QuickGELU activation.
- [C/pyTorch] Added fused RoPE implementation for improved speedup.
- [C/pyTorch] Added support for zero centered gamma in `RMSNorm`.
- [C/pyTorch] Added support for alibi slopes to all attention backends.
- [docs/pyTorch] Added a tutorial on accelerating HF Llama models with Transformer Engine.
- [JAX] Added support for sequence parallelism.
- [JAX] Added support for RoPE.
- [JAX] Increased execution speed in GQA.
- [paddle] Added support for grouped query attention (GQA).

Fixed Issues

- [pyTorch] Fixed an issue where uninitialized/unused module buffers resulted in increased memory usage with the `fp8_model_init` API call.
- [pyTorch] Fixed an issue in `MultiheadAttention` where the attention type was not properly passed down into granular API calls.
- [pyTorch] Fixed an issue that caused Transformer Engine to crash when used with pyTorch version >= 2.0 and < 2.1.
- [pyTorch] Fixed a convergence issue when using FP8 with activation recompute.
- [pyTorch] Fixed a numerical bug associated with use of pipeline parallelism.

Known Issues in This Release

- FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation or by installing FlashAttention v1 (e.g. with the command pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
- [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for the use case “cross attention with casual masking” when 2.1+ version of FlashAttentionA is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous Changes

FlashAttention v1 is not longer supported in Transformer Engine. Support for it was dropped in version 1.3. The minimum required FlashAttention version is v2.0.6.

1.3

Key Features and Enhancements

- [pyTorch] Added support for deferred parameter initialization in several Transformer Engine modules via the `device="meta"` parameter:
`Linear`
`LayerNorm`
`RMSNorm`
`LayerNormLinear`
`LayerNormMLP`
`MultiheadAttention`
`TransformerLayer`
- [pyTorch] Added support for CPU offloading of weights and activations for tensors saved for the backward pass for additional memory savings.
- [pyTorch] Added an additional `attn_input_format` parameter to `TransformerLayer` for the layout of the QKV tensor.
- [pyTorch] Added support for non-tensor values of the forward parameter when using the `checkpoint` API call.
- [PaddlePaddle] Added support for sequence parallelism.
- [PaddlePaddle] Optimized memory usage for pipeline parallel training.
- [JAX] Added support for grouped query attention (GQA).

Fixed Issues

- [pyTorch] In `LayerNormLinear` and `Linear`, unused copies of weight and bias tensors were not deleted for the case when Q, K, and V tensors are fused.
- [pyTorch] Faulty usage of pipeline parallelism with the FusedAttention backend.
- [pyTorch] attention_type was not correctly passed from the `MultiheadAttention` call to the `DotProductAttention` call.
- [pyTorch] Fused DPA backend reported bogus NaN errors during the backward pass.
- [pyTorch] Crashes when running with PyTorch v2.0.1.
- [pyTorch] Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.
- [JAX] Crashes when training in FP8 + FSDP.

Known Issues in This Release

- FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue by setting the environment variable MAX_JOBS=1 during Transformer Engine installation.
- [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep the consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous Changes

FlashAttention v1 is no longer supported in Transformer Engine. The minimum required version is v2.0.6.

1.2.1

Fixed Issues
- Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.

1.2

1.2.0

Key Features and Enhancements

- [pyTorch] Sliding window support is added for DotProductAttention.
- [pyTorch] Performance of DotProductAttention is increased on Hopper GPUs by utilizing cuDNN.
- [pyTorch] Support for the Falcon architecture is added in TransformerLayer via the new option `parallel_attention_mlp`.
- [pyTorch] Checkpointing logic when using `fp8_model_init` is improved.
- [JAX] Support is added for controlling SM margin in LayerNorm and RMSNorm kernel via environment variables `NVTE_FWD_LAYERNORM_SM_MARGIN` and `NVTE_BWD_LAYERNORM_SM_MARGIN`.

Fixed Issues

- Weight gradient could be computed incorrectly in some cases when FP8 execution and sequence parallelism were used together.
- Statistics were computed incorrectly during FP8 calibration.
- Using torch.compile on DotProductAttention module caused a crash.
- Rotary embeddings during pipeline-parallel inference did not operate correctly.
- Incorrect mask type used by the decoder in encoder-decoder architectures.
- Exporting Transformer Engine modules to ONNX in recent versions of pyTorch did not work correctly.

Known Issues in This Release

- FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable `MAX_JOBS=1` during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by running `pip install flash-attn==1.0.9`) before attempting to install Transformer Engine.
- [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release
There are no breaking changes in this release.

Deprecated Features
There are no deprecated features in this release.

Page 2 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.