Dlrover

Latest version: v0.3.6

Safety actively analyzes 623465 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.3.6

Features:
Flash checkpoint provides `FlashCkptTrainer` to support HuggingFace `transforemers.Trainer`.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.

BugFix:
Fix the segment fault when restarting the training process.

0.3.5

Features:
- Flash checkpoint supports saving and loading Megatron-LM MOE models. 1042
- APIs to extend the module to check the node with different chips. 1023
- Automatically mark the node as unschedulable if the node fails. 1025

BugFix:
- Fix the DDP example of mnist to save and load checkpoint. 1051
- Fix the checkpoint name of DDP. 1034

0.3.4

Features:

- Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
- `dlrover-run --auto-config` Automatically configure the number of nodes and the number of processes per node.
- Users can customize the APIs of storage to save the checkpoint into different file systems.
- Deletion strategy to clean the old checkpoint files.

BugFix:
- The shared memory does not exist if the size of the checkpoint changes.

0.3.3

Features:
- Support Python > 3.10.
- Support restarting the training process on Ascend NPU.
- Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:
- Fix the checkpoint shard inconsistency of all ranks.
- Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
- Fix the bug to load the Megatron-LM checkpoint.

0.3.1

Feature:
- Users can use flash checkpoint using `torchrun` or `python -m torch.distributed.launch`.

Bugfix:
- The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

0.3.0

Features:

- Flash Checkpoint to asynchronously persist checkpoint to storage.
- Flash Checkpoint recovers failure in memory.
- Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
- Node detection supports NPU.

Examples

- The example of training nanoGPT using DeepSpeed.
- The example to save/load sharding FSDP checkpoint.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.