You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I set os.environ["NCCL_SHARP_DISABLE"] = "1" after data parallel, the expect result is data parallel pg will allocate sharp resources, the model parallel pg and the tensor parallel pg will not allocate sharp resources.
In Megatron-LM repo https://github.com/NVIDIA/Megatron-LM/blob/v3.0.2/megatron/mpu/initialize.py#L62, there are three positions will create pg through torch.distributed.new_group.
If I set os.environ["NCCL_SHARP_DISABLE"] = "1" after data parallel, the expect result is data parallel pg will allocate sharp resources, the model parallel pg and the tensor parallel pg will not allocate sharp resources.
But from repo https://github.com/Mellanox/nccl-rdma-sharp-plugins/blob/master/src/sharp_plugin.c#L252 and my experiment, debug log reports "SHARP: Set to disable on this communicator" and all pg can not allocate sharp resources, this is not in line with expectations.
Could you check this problem ?
The text was updated successfully, but these errors were encountered: