1. Enhance communication features: a2a overlap with computation, support different granularity of group creation, etc.
2. Add single-thread CPU implementation for correctness check & reference;
3. Refine JIT compiler interface for flexible usability: jit::inject_source && jit::jit_execute;
4. Enhance examples: fp64 support, cuda amp, checkpointing, etc.
5. Support execution inside torch.distributed.pipeline.
sh
How to Setup:
python3 -m pip install --user https://github.com/microsoft/tutel/archive/refs/tags/v0.1.4.tar.gz
Contributors: yzygitzh, ghostplant, EricWangCN