Highlights
* Speed improvements
* Up to 6x faster CPU indexing [benchmarks](https://github.com/ml-explore/mlx/pull/1450)
* Faster Metal compiled kernels for strided inputs [benchmarks](https://github.com/ml-explore/mlx/pull/1486)
* Faster generation with fused-attention kernel [benchmarks](https://github.com/ml-explore/mlx/pull/1497)
* Gradient for grouped convolutions
* Due to Python 3.8's end-of-life we no longer test with it on CI
Core
* New features
* Gradient for grouped convolutions
* `mx.roll`
* `mx.random.permutation`
* `mx.real` and `mx.imag`
* Performance
* Up to 6x faster CPU indexing [benchmarks](https://github.com/ml-explore/mlx/pull/1450)
* Faster CPU sort [benchmarks](https://github.com/ml-explore/mlx/pull/1453)
* Faster Metal compiled kernels for strided inputs [benchmarks](https://github.com/ml-explore/mlx/pull/1486)
* Faster generation with fused-attention kernel [benchmarks](https://github.com/ml-explore/mlx/pull/1497)
* Bulk eval in safetensors to avoid unnecessary serialization of work
* Misc
* Bump to nanobind 2.2
* Move testing to python 3.9 due to 3.8's end-of-life
* Make the GPU device more thread safe
* Fix the submodule stubs for better IDE support
* CI generated docs that will never be stale
NN
* Add support for grouped 1D convolutions to the nn API
* Add some missing type annotations
Bugfixes
* Fix and speedup row-reduce with few rows
* Fix normalization primitive segfault with unexpected inputs
* Fix complex power on the GPU
* Fix freeing deep unevaluated graphs [details](https://github.com/ml-explore/mlx/pull/1462)
* Fix race with `array::is_available`
* Consistently handle softmax with all `-inf` inputs
* Fix streams in affine quantize
* Fix CPU compile preamble for some linux machines
* Stream safety in CPU compilation
* Fix CPU compile segfault at program shutdown