Highlights
We excited to introduce many new features in NVIDIA Resiliency Extension v0.2
* In-process restart – Provides a mechanism to restart the training without killing the running process via a Python function wrapper. Compared to a traditional scheduler-level restart, restarting within the same process removes overheads associated with launching a new scheduler job, starting a container, initializing a new Python interpreter, loading dependencies, and creating a new CUDA context.
* Asynchronous checkpoint - Provides core utilities to make checkpointing routines run in the background. It uses torch.multiprocessing to fork a temporary process to initiate asynchronous checkpointing routine. Application can check this asynchronous checkpoint save in a non-blocking manner and specify a user-defined finalization step when all ranks finish their background checkpoint saving.
* Local checkpoint - Provides a mechanism to create a checkpoint in local host memory. The local checkpointing mechanism is implemented via the Python LocalCheckpointManager class, which operates on a TensorAwareStateDict wrapper. This wrapper encapsulates the operations necessary for efficient replication and data transfers.
Known Issues & Limitations
* For in-process restart - If there is hang, presence of SHARP raises an exception, which leads to triggering in-job restart and in-process restart. Customer needs to disable SHARP for using in-process restart with current version. Requires ENV VARs to be set as follows:
NCCL_NVLS_ENABLE=0 to disable SHARP.
NCCL_NET_PLUGIN="none" if NCCL version < 2.24.1 to avoid duplicate NCCL net plugin init.
* In-process and in-job restart works with PyTorch version 24.07, 24.08, 24.09, and 24.10 but not with 24.11 due to a known NCCL issue
Contributors
grzegorz-k-karch jbieniusiewi j-szulc mikolajblaz sbak skierat srogawski-nvidia szmigacz yzhautouskay