Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enable activation checkpoint offloading #722

Merged
merged 5 commits into from
Nov 25, 2024

Conversation

chang-l
Copy link
Contributor

@chang-l chang-l commented Nov 21, 2024

Modulus Pull Request

Description

Add support for offloading checkpoints themselves to the CPU to further alleviate device memory pressure.

To enable:
+model.checkpoint_offloading=true

Results

Using the ahmed_body dataset with 15 checkpoint segments and 15 processor layers (one checkpoint per processor layer), enabling this option can reduce peak device memory usage from 5.1 GB to 2.63 GB, with ~0.63X speedup on DGX-A100 (0.84X on GH200).
With 1 checkpoint/processor layer only:
image

With 1 checkpoint/processor and offloading:
image

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Todo:

This is a plain PyTorch-based implementation and definitely has room for improvement. By adding double-buffer to overlap data transfers (D2H and H2D) with computation, we could potentially eliminate the copying overhead. We can address this in a future update.

@chang-l chang-l changed the title Add act offloading option [Feature] Enable activation checkpoint offloading Nov 21, 2024
@mnabian mnabian self-requested a review November 21, 2024 17:46
@mnabian
Copy link
Collaborator

mnabian commented Nov 21, 2024

LGTM. Please update the changelog.

@mnabian
Copy link
Collaborator

mnabian commented Nov 21, 2024

/blossom-ci

@mnabian
Copy link
Collaborator

mnabian commented Nov 25, 2024

/blossom-ci

@mnabian
Copy link
Collaborator

mnabian commented Nov 25, 2024

/blossom-ci

@mnabian
Copy link
Collaborator

mnabian commented Nov 25, 2024

/blossom-ci

@mnabian mnabian merged commit a5d3b5b into NVIDIA:main Nov 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants