Highlights:
- Fast partial RoPE (used by Phi-2)
- Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, [benchmarks](https://github.com/ml-explore/mlx/pull/883#issue-2204137982)
Core
- More overhead reductions
- Partial fast RoPE (fast Phi-2)
- Better buffer donation for copy
- Type hierarchy and issubdtype
- Fast VJPs for RoPE, RMSNorm, and LayerNorm
NN
- `Module.set_dtype`
- Chaining in `nn.Module` (`model.freeze().update(…)`)
Bugfixes
- Fix set item bugs
- Fix scatter vjp
- Check shape integer overlow on array construction
- Fix bug with module attributes
- Fix two bugs for odd shaped QMV
- Fix GPU sort for large sizes
- Fix bug in negative padding for convolutions
- Fix bug in multi-stream race condition for graph evaluation
- Fix random normal generation for half precision