What's Changed
- Support for bfloat16
- Optimizations for top_k > 1
- Support for fully-sharded data parallelism
- Support tensor model parallelism when expert_parallel_world_size > num_experts
- Optimizations for activation memory
- Support activation quantization (thanks dblalock!)
- Optimizations for SM90 (Hopper)
- Lots of bug fixes, cleanup and small optimizations
New Contributors
* vchiley made their first contribution in https://github.com/stanford-futuredata/megablocks/pull/9
* deepakn94 made their first contribution in https://github.com/stanford-futuredata/megablocks/pull/16
* b-chu made their first contribution in https://github.com/stanford-futuredata/megablocks/pull/19
**Full Changelog**: https://github.com/stanford-futuredata/megablocks/compare/v0.1...v0.3.2