[Feature] Enable activation checkpoint offloading #722

chang-l · 2024-11-21T05:59:24Z

Modulus Pull Request

Description

Add support for offloading checkpoints themselves to the CPU to further alleviate device memory pressure.

To enable:
+model.checkpoint_offloading=true

Results

Using the ahmed_body dataset with 15 checkpoint segments and 15 processor layers (one checkpoint per processor layer), enabling this option can reduce peak device memory usage from 5.1 GB to 2.63 GB, with ~0.63X speedup on DGX-A100 (0.84X on GH200).
With 1 checkpoint/processor layer only:

With 1 checkpoint/processor and offloading:

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Todo:

This is a plain PyTorch-based implementation and definitely has room for improvement. By adding double-buffer to overlap data transfers (D2H and H2D) with computation, we could potentially eliminate the copying overhead. We can address this in a future update.

mnabian · 2024-11-21T21:06:05Z

LGTM. Please update the changelog.

mnabian · 2024-11-21T21:06:23Z

/blossom-ci

mnabian · 2024-11-25T17:59:26Z

/blossom-ci

mnabian · 2024-11-25T19:51:50Z

/blossom-ci

mnabian · 2024-11-25T20:15:42Z

/blossom-ci

Add act offloading option

6fbfee0

chang-l changed the title ~~Add act offloading option~~ [Feature] Enable activation checkpoint offloading Nov 21, 2024

mnabian self-requested a review November 21, 2024 17:46

mnabian approved these changes Nov 21, 2024

View reviewed changes

chang-l added 2 commits November 21, 2024 13:34

Add changelog

bfca8bc

Turn off offloading when ac is not enabled

77f23cb

Merge branch 'main' into act_checkpoint_offloading

b07e066

formatting

e20b89b

mnabian merged commit a5d3b5b into NVIDIA:main Nov 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable activation checkpoint offloading #722

[Feature] Enable activation checkpoint offloading #722

chang-l commented Nov 21, 2024 •

edited

Loading

mnabian commented Nov 21, 2024

mnabian commented Nov 21, 2024

mnabian commented Nov 25, 2024

mnabian commented Nov 25, 2024

mnabian commented Nov 25, 2024

[Feature] Enable activation checkpoint offloading #722

[Feature] Enable activation checkpoint offloading #722

Conversation

chang-l commented Nov 21, 2024 • edited Loading

Modulus Pull Request

Description

Results

Checklist

Todo:

mnabian commented Nov 21, 2024

mnabian commented Nov 21, 2024

mnabian commented Nov 25, 2024

mnabian commented Nov 25, 2024

mnabian commented Nov 25, 2024

chang-l commented Nov 21, 2024 •

edited

Loading