Transformer-engine

Latest version: v1.11.0

Safety actively analyzes 682441 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 3

1.1

1.1.0

Key Features and Enhancements

* [pyTorch] Memory usage is reduced when using the `fp8_model_init` API during inference.
* [pyTorch] Memory usage is reduced when using the `LayerNormLinear`, `LayernormMLP`, and `TransformerLayer` APIs.
* [JAX] Transformer Engine is migrated to the new Custom Partitioning mechanism of parallelism for custom ops in JAX.
* [JAX] The attention operation’s performance is improved when using cuDNN version 8.9.6 or greater.
* [C/C++] Transformer Engine can now be built as a subproject.

Fixed Issues

* Fixed an issue where in some cases passing the non-contiguous tensors as Q, K, or V to `DotProductAttention` would result in an error, “Exception: The provided qkv memory layout is not supported!.”

Known Issues in This Release

* FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). One could workaround this issue by either setting the MAX_JOBS=1 environment variable during Transformer Engine installation or installing FlashAttention v1 (e.g. by `pip install flash-attn==1.0.9`) before attempting to install Transformer Engine.
* [pyTorch] FlashAttention v2.1 has changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to preserve consistent behavior between versions and back ends, FlashAttention is disabled for this use case (i.e. cross-attention with casual masking) when FlashAttention version 2.1+ is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

1.0

1.0.0

Key Features and Enhancements

* [pyTorch] Expanded the support for different layouts in `DotProductAttention`.
* [pyTorch] Added support for packed input for the FlashAttention backend of `DotProductAttention`.
* [pyTorch] Better support for the KV cache during inference via the new `InferenceParams` class
* [pyTorch] Better support for parallel state handling for model parallelism via the new `CUDARNGStatesTracker` class
* [pyTorch] Added an experimental support for the FP8 Tensor type and a new context manager `fp8_model_init`. When enabled, Transformer Engine modules created inside this `fp8_model_init` region will hold only FP8 copies of its parameters, as opposed to the default behavior where both higher precision and FP8 copies are present. This may result in lower memory consumption and is especially useful for scenarios like:

* full model training using optimizer with master weights, where the high precision copies of weights are already present in the optimizer.
* inference, where only the FP8 copies of the parameters are used.
* LoRA-like fine-tuning, where the main parameters of the model do not change.
* [JAX] Added an ability to set dropout rate for the activation output in `LayerNormMLP`.
* [Paddle] Added documentation.

Fixed Issues

* [pyTorch] Multiple fixes for activation recomputation when using FP8.
* [pyTorch] Multiple fixes specific to the usage of Transformer Engine by Megatron-LM and NeMo.
* [pyTorch] Fixed a crash occuring when trying to use `LayerNormLinear` with the `return_layernorm_output` option set.
* [pyTorch] Fixes to the ONNX export of the attention layer.
* [pyTorch] Fixed a crash happening when using RoPE.
* [JAX] Fixed a crash occuring in some cases when using cross attention with FSDP.
* [JAX] Fixed the wrong handling of the FP8 scaling factor.

Known Issues in This Release

* FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). One could workaround this issue by either setting the MAX_JOBS=1 environment variable during Transformer Engine installation or installing FlashAttention v1 (e.g. by `pip install flash-attn==1.0.9`) before attempting to install Transformer Engine.
* [pyTorch] In some cases passing the non-contiguous tensors as Q, K or V to `DotProductAttention` may result in an error `Exception: The provided qkv memory layout is not supported!` It will be fixed in a future release. In the meantime, the workaround is to call `.contiguous()` on those tensors before passing them to `DotProductAttention`.

Breaking Changes in This Release

* The experimental support for TensorFlow has been removed.
* [pyTorch] The deprecated `TransformerLayer` arguments `attention_softmax_in_fp32` and `apply_query_key_layer_scaling` were removed.
* [pyTorch] Deprecated argument `skip_weight_param_allocation` in the `Linear` and `LayerNormLinear` API has been removed. Consequently, the `weight` and `bias` arguments in the `forward` method of those APIs have also been removed.
* [pyTorch] Support for loading old/deprecated checkpoint formats where the extra states for FP8 are not serialized into `BytesIO` or `torch.Tensor` objects has been removed.
* [JAX] Deprecated modules and functions `DenseGeneral`, `LayerNorm`, `LayerNormDenseGeneral`, `LayerNormMLP`, `TransformerEngineBase`, `extend_logical_axis_rules`, `MultiHeadAttention`, `RelativePositionBiases`, `TransformerLayer`, and `TransformerLayerType` have been removed from `transformer_engine.jax` and must now only be imported from `transformer_engine.jax.flax`.

Deprecated Features

There are no deprecated features in this release.

Page 3 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.