Highlights
* Block sparse matrix multiply speeds up MoEs by >2x
* [some numbers](https://github.com/ml-explore/mlx/pull/1058)
* Improved quantization algorithm should work well for all networks
* [see evaluations](https://github.com/ml-explore/mlx/pull/1061)
* Improved gpu command submission speeds up training and inference
* [some numbers](https://github.com/ml-explore/mlx/pull/1085)
Core
* Bitwise ops added:
- `mx.bitwise_[or|and|xor]`, `mx.[left|right]_shift`, operator overloads
* Groups added to Conv1d
* Added `mx.metal.device_info` to get better informed memory limits
* Added resettable memory stats
* `mlx.optimizers.clip_grad_norm` and `mlx.utils.tree_reduce` added
* Add `mx.arctan2`
* Unary ops now accept array-like inputs ie one can do `mx.sqrt(2)`
Bugfixes
* Fixed shape for slice update
* Bugfix in quantize that used slightly wrong scales/biases
* Fixed memory leak for multi-output primitives encountered with gradient checkpointing
* Fixed conversion from other frameworks for all datatypes
* Fixed index overflow for matmul with large batch size
* Fixed initialization ordering that occasionally caused segfaults