Async Checkpointing
The asynchronous checkpointing feature in the NVIDIA Resiliency Extension provides core utilities to offload checkpointing routines to the background.It leveragestorch.multiprocessing to either fork a temporary process or spawn a persistent process for efficient, non-blocking checkpointing.
Applications can monitor asynchronous checkpoint progress in a non-blocking mannerand define a custom finalization step once all ranks complete their background checkpoint saving.
This repository includes an implementation of asynchronous checkpointing utilities for bothtorch.save andtorch.distributed.save_state_dict.Our modifiedtorch.distributed.save_state_dict interface is integrated with an optimized backend,FileSystemWriterAsync, which:• Runs in the async checkpoint process creating child parallel processes for intra-node parallelism, avoiding GIL contention.• Minimizes metadata communication overhead by metadata caching, ensuring efficient checkpoint saving.
Contents: