-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel #1292
Comments
i found it's because chain bucketing order is not matched with forward path |
Can you share a small reproduction script? |
@deepakn94 hi deepak, good to see you! but in this scenario, i added custom layers in the end of the transformer layer block (after this line) class TransformerLayer(MegatronModule, BaseTransformerLayer):
def __init__(
self,
config: TransformerConfig,
submodules: TransformerLayerSubmodules,
layer_number: int = 1,
hidden_dropout: float = None,
):
...
# [Module 8: MLP block]
# TODO how to set the gpt_layer_spec.py when we have moe_frequency > 1,
# where MLP and MoE layer both appear alternately?
self.mlp = build_module(submodules.mlp, config=self.config)
if hasattr(self.mlp, 'set_layer_number'):
self.mlp.set_layer_number(self.layer_number)
# [Module 9: BiasDropoutFusion]
self.mlp_bda = build_module(submodules.mlp_bda)
# @jcasper how should we handle nvfuser?
# Set bias+dropout+add fusion grad_enable execution handler.
# TORCH_MAJOR = int(torch.__version__.split('.')[0])
# TORCH_MINOR = int(torch.__version__.split('.')[1])
# use_nvfuser = TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 10)
# self.bias_dropout_add_exec_handler = nullcontext if use_nvfuser else torch.enable_grad
self.bias_dropout_add_exec_handler = torch.enable_grad
## here, custom layers are added in the end of init
self.attn_out_rmsnorm = ...
self.fc2_rmsnorm = ... but this layers does not forwarded in this order, # If current bucket's param AG has not been dispatched, dispatch it now (e.g., first
# AG bucket in first model chunk if ddp_config.align_param_gather is False).
if not self.param_gather_dispatched:
self.start_param_sync()
if self.param_gather_handle is not None:
self.param_gather_handle.wait()
self.param_gather_handle = None
# Dispatch next bucket's asynchronous param AG.
if self.next_param_gather_bucket_group is not None and not skip_next_bucket_dispatch:
self.next_param_gather_bucket_group.start_param_sync() so i fixed above code snippet like this if self.param_gather_handle is not None:
self.param_gather_handle.wait()
self.param_gather_handle = None
# Dispatch next bucket's asynchronous param AG.
if (
self.next_param_gather_bucket_group is not None and not skip_next_bucket_dispatch
) and (
not self.next_param_gather_bucket_group.param_gather_dispatched
):
self.next_param_gather_bucket_group.start_param_sync() if my explanation lacks information, please reply again or email me, ty! |
Thank you for providing this patch! I’ve tested it, and it indeed allows training to proceed. However, I’ve observed an issue with checkpointing: after saving a checkpoint, the loss immediately diverges. This suggests that the checkpointing logic is also affected by the mismatch in parameter declaration and usage order. As a temporary workaround, I’ve adjusted the parameter declaration order to align with the forward pass usage order. Let me know if there’s a more robust solution in progress or if additional details from my setup would help with debugging.
|
@fanzhongyi thank you for the test! tbh, i didnt check checkpoint loading. ill test this too. thank you so much :) |
@deepakn94 may i ask your opinion sir |
it's 4 node experiment where i used distributed_optimizer, overlap param gather and grad all reduce as True and tp=2, pp=4.
idk why next linear fc2's next_param_gather_bucket_group has asnyc param_gather context manager ...?
The text was updated successfully, but these errors were encountered: