-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using SHARP failed which sharp_coll_comm_init running failed. #151
Comments
Do your IB network switches support SHARP? |
Yes, my IB network switch does support SHARP, and the HCA card is CX7: And the sharp_am srevice is already enabled on the opensm master: [root@C27L1 ~]# ps aux |grep opensm Mar 17 22:07:23 C27L1 sharp_am[18222]: Package: sharp-rc3 |
@shanleo1986 I have same issue, Have you solved? Beside, I run nccl-test with sharp is normal, but get this error when run megatron-lm gpt3 example. |
@liuxingbo12138 try add |
Did anyone run the IB sharp successfully? Hope to get some help. Thanks. |
@AddyLaddy Hi AddyLaddy, |
Yes, we use NCCL IB SHARP on 8x CX-7 NICS concurrently on our internal machines, but our network uses a rail optimized topology for the first layers. |
Hi developer,
I have built the SHARP env, and the sharp plugin has been loaded successfylly.
When run this function sharp_coll_comm_init , it return error, so finally the nccl use the P2P NET.
Can you give me some help to analysis this issue, thank you!
The following is the error log:
[C25L18:0:24972 - context.c:702] INFO job (ID: 1201360188720575732) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:24972 - context.c:895] INFO sharp_job_id:12 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:24972 - context.c:899] INFO sharp_job_id:12 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
C25L19:19373:19491 [3] NCCL INFO Sharp rank 1/2 initialized on mlx5_5:1
C25L18:24972:25066 [3] NCCL INFO Sharp rank 0/2 initialized on mlx5_5:1
[C25L18:0:24972 - comm.c:374] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)
[C25L19:1:19373 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)
C25L19:19373:19491 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)
C25L18:24972:25066 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)
The text was updated successfully, but these errors were encountered: