Features: * Flash Checkpoint to asynchronously persist checkpoint to storage. * Flash Checkpoint recovers failure in memory. * Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron * Node detection supports NPU.
Examples * The example of training nanoGPT using DeepSpeed. * The example to save/load sharding FSDP checkpoint.
0.2.2
Features: * dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment. * DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.
BugFix: * Fix the bug to load the FSDP checkpoint.
0.2.1
* Autotuning batch size without restarting the job. * Automatically detect the straggler (slow worker). * TFPlus: TFPlus 0.1.0 has been released, see detail in https://github.com/intelligent-machine-learning/dlrover/tree/master/tfplus