Highlights
- Fast Metal GPU FFTs
- On average ~30x faster than CPU
- [More benchmarks](https://github.com/ml-explore/mlx/pull/1102)
- `mx.distributed` with `all_sum` and `all_gather`
Core
- Added dlpack device `__dlpack_device__`
- Fast GPU FFTs [benchmarks](https://github.com/ml-explore/mlx/pull/1102)
- Add docs for the `mx.distributed`
- Add `mx.view` op
NN
- ``softmin``, ``hardshrink``, and ``hardtanh`` activations
Bugfixes
- Fix broadcast bug in bitwise ops
- Allow more buffers for JIT compilation
- Fix matvec vector stride bug
- Fix multi-block sort stride management
- Stable cumprod grad at 0
- Buf fix with race condition in scan