Stage3: Use new torch grad accumulation hooks API #6773

deepcharm · 2024-11-21T14:57:53Z

This commit addresses a Deepspeed issue #6718
The existing code has been using the grad_acc node hook to reduce params grads.
The constructs such as param.data = replicated_tensor.data used in allgather_params(..)
are compiled into param.set() causing the hook assigned to the grad_acc node not being called.
The above caused accuracy issues and could be temporarily solved by simply disabling the torch compile when activation checkpointing is used.
This commit provides a clean solution by replacing the hook on a grad_acc node to a hook using a new and robust hook API on a param itself: param.register_post_accumulate_grad_hook(..)

* This commit addresses an issue reported in: microsoft#6718 * The existing code has been using the grad_acc node hook to reduce params grads. The constructs such as param.data = replicated_tensor.data used in allgather_params(..) are compiled into param.set() causing the hook assigned to the grad_acc node not being called. * This is a known torch issue pytorch/pytorch#139742. * The above caused accuracy issues and could be temporarily solved by simply disabling the torch compile when activation checkpointing is used. * This commit provides a clean solution by replacing the hook on a grad_acc node to a hook using a new and robust hook API on a param itself: param.register_post_accumulate_grad_hook(..)

tjruwase · 2024-11-21T16:04:34Z

deepspeed/runtime/zero/stage3.py

-                        self._grad_acc_hooks.append(grad_acc.register_hook(reduce_partition_and_remove_grads))
-                        self.grad_accs.append(grad_acc)
+                        self._grad_acc_hooks.append(
+                            param.register_post_accumulate_grad_hook(reduce_partition_and_remove_grads))


Which pytorch version introduced this API? How should we handle older versions?

deepcharm requested a review from tjruwase as a code owner November 21, 2024 14:57

tjruwase reviewed Nov 21, 2024

View reviewed changes

yitingw1 mentioned this pull request Nov 22, 2024

[Compiled_autograd] running nn.LayerNorm failed for torch.compile with compiled_autograd when deepspeed Zero3 pytorch/pytorch#140091

Open

Merge branch 'master' into stage3-use-new-grad-acc-api

99bb156

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage3: Use new torch grad accumulation hooks API #6773

Stage3: Use new torch grad accumulation hooks API #6773

deepcharm commented Nov 21, 2024

tjruwase Nov 21, 2024

Stage3: Use new torch grad accumulation hooks API #6773

Are you sure you want to change the base?

Stage3: Use new torch grad accumulation hooks API #6773

Conversation

deepcharm commented Nov 21, 2024

tjruwase Nov 21, 2024

Choose a reason for hiding this comment