Highlights
* Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
- Series of PRs [1](https://github.com/ml-explore/mlx/pull/1105), [2](https://github.com/ml-explore/mlx/pull/1123), [3](https://github.com/ml-explore/mlx/pull/1091), [4](https://github.com/ml-explore/mlx/pull/1132), [5](https://github.com/ml-explore/mlx/pull/1139)
* `mx.gather_qmm` quantized equivalent for `mx.gather_mm` which speeds up MoE inference by ~2x
- [Some numbers](https://github.com/ml-explore/mlx-examples/pull/782)
* Grouped 2D convolutions
- [Some numbers](https://github.com/ml-explore/mlx/pull/1129)
Core
* `mx.conjugate`
* `mx.conv3d` and `nn.Conv3d`
* List based indexing
* Started `mx.distributed` which uses MPI (if installed) for communication across machines
- `mx.distributed.init`
- `mx.distributed.all_gather`
- `mx.distributed.all_reduce_sum`
* Support conversion to and from dlpack
* `mx.linalg.cholesky` on CPU
* `mx.quantized_matmul` sped up for vector-matrix products
* `mx.trace`
* `mx.block_masked_mm` now supports floating point masks!
Fixes
* Error messaging in eval
* Add some missing docs
* Scatter index bug
* The extensions example now compiles and runs
* CPU copy bug with many dimensions