[Feature] Enable activation checkpoint offloading #722
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Modulus Pull Request
Description
Add support for offloading checkpoints themselves to the CPU to further alleviate device memory pressure.
To enable:
+model.checkpoint_offloading=true
Results
Using the ahmed_body dataset with 15 checkpoint segments and 15 processor layers (one checkpoint per processor layer), enabling this option can reduce peak device memory usage from 5.1 GB to 2.63 GB, with ~0.63X speedup on DGX-A100 (0.84X on GH200).
With 1 checkpoint/processor layer only:
With 1 checkpoint/processor and offloading:
Checklist
Todo:
This is a plain PyTorch-based implementation and definitely has room for improvement. By adding double-buffer to overlap data transfers (D2H and H2D) with computation, we could potentially eliminate the copying overhead. We can address this in a future update.