Xformers

Latest version: v0.0.26.post1

Safety actively analyzes 642295 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

0.0.25

0.0.24

0.0.23

Pre-built binary wheels require PyTorch 2.1.1 (xFormers `0.0.23`) or PyTorch 2.1.2 (xFormers `0.0.23.post1`).
Fixed
- fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with `length%64 == 1`
- fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports `BlockDiagonalCausalWithOffsetPaddedKeysMask`
Added
- fMHA: Added `LocalAttentionFromBottomRightMask` (local)
- fMHA: Added `LowerTriangularFromBottomRightMask` (causal)
- fMHA: Added `LowerTriangularFromBottomRightLocalAttentionMask` (local + causal)
Removed
- Removed `xformers.triton.sum_strided`

0.0.22

Fixed
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
Added
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to `memory_efficient_attention`, see the documentation for more details
- fMHA: Added experimental support for Local Attention biases to `memory_efficient_attention`
- Added an example of efficient [LLaMa decoding](https://github.com/facebookresearch/xformers/tree/main/examples/llama_inference) using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
- `xformers.info` now indicates the Flash-Attention version used
Removed
- fMHA: Removed `smallK` backend support for CPU. `memory_efficient_attention` only works for CUDA/GPU tensors now
- **DEPRECATION**: Many classes in `xformers.factory`, `xformers.triton` and `xformers.components` have been or will be deprecated soon (see tracking issue facebookresearch/xformers848)

0.0.21

Improved
- fMHA: Updated [flash-attention](https://github.com/Dao-AILab/flash-attention) to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
Bug fixes
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix `attn_bias` stride overflow for very long sequences (>32k)
- `LowerTriangularMask` is now backward compatible with older xformers versions
Breaking changes
- `memory_efficient_attention` now expects the `attn_bias` argument to have a head dimension
- `memory_efficient_attention` no longer broadcasts the batch/head dimensions of `attn_bias`. Please use `.expand` if you need to broadcast the bias
- Remove `causal_diagonal` argument from `BlockDiagonalCausalWithOffsetPaddedKeysMask`
Added
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

**NOTE**: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

0.0.20

Improved
- fMHA/cutlass (backward): Massive performance improvements when `batch_size * num_heads` is low (10x+)
- fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when `embed_dim>64`
- fMHA: Updated Flash-Attention to `v1.0.5`
Added
- fMHA now runs on H100 (support is experimental)

Page 2 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.