Highlights
* Support 3 and 6 bit quantization: [benchmarks](https://github.com/ml-explore/mlx/pull/1613)
* Much faster memory efficient attention for headdim 64, 80: [benchmarks](https://github.com/ml-explore/mlx/pull/1610)
* Much faster sdpa inference kernel for longer sequences: [benchmarks](https://github.com/ml-explore/mlx/pull/1597)
Core
* `contiguous` op (C++ only) + primitive
* Bfs width limit to reduce memory consumption during `eval`
* Fast CPU quantization
* Faster indexing math in several kernels:
* unary, binary, ternary, copy, compiled, reduce
* Improve dispatch threads for a few kernels:
* conv, gemm splitk, custom kernels
* More buffer donation with no-ops to reduce memory use
* Use `CMAKE_OSX_DEPLOYMENT_TARGET` to pick Metal version
* Dispatch Metal bf16 type at runtime when using the JIT
NN
* `nn.AvgPool3d` and `nn.MaxPool3d`
* Support `groups` in `nn.Conv2d`
Bug fixes
* Fix per-example mask + docs in sdpa
* Fix FFT synchronization bug (use dispatch method everywhere)
* Throw for invalid `*fft{2,n}` cases
* Fix OOB access in qmv
* Fix donation in sdpa to reduce memory use
* Allocate safetensors header on the heap to avoid stack overflow
* Fix sibling memory leak
* Fix `view` segfault for scalars input
* Fix concatenate vmap