Natten

Latest version: v0.17.5

Safety actively analyzes 724051 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

0.17.5

* Added support for even-sized kernels!
* NATTEN now allows any kernel size greater than 1 in fused ops.
* Only available in Fused NA (both the CUTLASS 2.X kernels and Flex) for now.
* NOTE: any even sized kernel `2r` will force each token to attend to `r` tokens on the left,
itself, and `r - 1` tokens on the right (in non-corner cases).
* Added Flex Attention as a backend.
* Now you can use Flex Attention instead of FNA through NATTEN directly.
* Just import `use_flex_attention()` from `natten`, call it, and enjoy potentially significant
speedups on newer architectures.
* With support for additional KV tokens.
* NOTE: we've been observing some instabilities with Flex Attention when using torch 2.6. We'll
try to raise the issue with the PyTorch team, but please proceed with caution.
* Better precision on fused ops with additional KV.
* Torch 2.6 support.
* Dropped support for CTK < 12.0, and torch < 2.5
* Dropped deprecated ops (`natten.functional.natten*d{qk,qkrpb,av}`)

0.17.4

* Support for additional KV tokens in FNA (requires xFormers)
* Adds experimental support for additional KV tokens (attend to local neighborhood, and some
additional context) to FNA interfaces, with training support.
* The attention branch between Q and additional KVs runs with xFormers, which targets FAv2.
Eventually, we'd want this branch to use PyTorch's SDPA directly, but as of now, there is no
SDPA interface that returns logsumexp along with the output, which makes this impossible.
* Reduction is done in pure torch, and if possible will be fused into a single op with torch
compile.
* In theory, any number of different attentions can be merged in this way, but the interface only
allows one additional KV set for now.
* Bug fixes in FLOP counter (with fvcore)
* Bugs introduced since FNA backwards were fixed.
* Finally added unit tests.
* Better FLOP counting support
* Rename instances of FLOP counting with fvcore to MACs, since that's what fvcore reports.
* Add experimental support for torch's native flop counter
* Better documentation

0.17.3

* Bug fix for torch < 2.4
* 0.17.2 release will be directly replaced with 0.17.3.

0.17.2

* Enable KV parallelism by default
* No realistic use case will disable KV parallelism, because it virtually kills occupancy in any
small-batch/few-head case. Most packages should be using this by default, the same way PyTorch's
deterministic mode is disabled by default. Users will still get a warning if PyTorch or NATTEN's
deterministic mode is enabled. (167)
* Bug fixes
* Fix rare DDP issue (167).
* Fix inconsistencies in docs. (167)
* QoL
* Switch from `torch.cuda.amp` to `torch.amp` since the former is deprecated (168)
* Binaries for torch 2.5.

0.17.1

* Fixed interface for python 3.8 and 3.9

0.17.0

* [Fused neighborhood attention](https://github.com/SHI-Labs/NATTEN/tree/main/docs/fna) (FNA) kernels
* 1D, 2D and 3D Neighborhood Attention are supported,
* Causal neighborhood attention is implemented,
* Window (kernel) size, dilation, and causality can be defined *per-axis*,
* All GPU architectures since Maxwell (SM50) are supported,
* SM50 up to SM70 are SIMT-only, but support both FP16 and FP32,
* SM70 and SM75 target Tensor Cores in FP16, and SIMT-style in FP32,
* SM80 and above target Tensor Cores in FP16, BF16, and FP32.
* NATTEN [Auto-tuner](https://github.com/SHI-Labs/NATTEN/blob/main/docs/fna/autotuner.md),
* Memory preferences and [KV parallelism](https://github.com/SHI-Labs/NATTEN/blob/main/docs/fna/kv-parallelism.md) modes,
* Relative positional biases are only supported in forward pass (inference).
* Memory layout in FNA is different from existing kernels (`[B, *, heads, dim]` instead of `[B, heads, *, dim]`.)
* Eventually this layout can skip over the permute/explicit reshape step in the attention module following
the QKV projection.
* For more refer to [Fused vs unfused NA](docs/fna/fused-vs-unfused.md).
* Naive kernels now implement and allow causal masking,
* Naive kernels (CPU and CUDA) now allow varying parameters (window size, dilation, causal) across axes,
* Major bug fix in Volta GEMM kernels
* The epilogue was different for Volta, and it slipped through unit tests,
* Tests are now more aggressive, and the issue has been fixed.
* Memory alignment bug in half RPB gradient kernels fixed
* See [97](https://github.com/SHI-Labs/NATTEN/issues/97).

Page 1 of 3

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.