Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using SHARP failed which sharp_coll_comm_init running failed. #151

Open
shanleo2024 opened this issue Mar 18, 2024 · 9 comments
Open

Using SHARP failed which sharp_coll_comm_init running failed. #151

shanleo2024 opened this issue Mar 18, 2024 · 9 comments

Comments

@shanleo2024
Copy link

shanleo2024 commented Mar 18, 2024

Hi developer,
I have built the SHARP env, and the sharp plugin has been loaded successfylly.
When run this function sharp_coll_comm_init , it return error, so finally the nccl use the P2P NET.
Can you give me some help to analysis this issue, thank you!

The following is the error log:
[C25L18:0:24972 - context.c:702] INFO job (ID: 1201360188720575732) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:24972 - context.c:895] INFO sharp_job_id:12 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:24972 - context.c:899] INFO sharp_job_id:12 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
C25L19:19373:19491 [3] NCCL INFO Sharp rank 1/2 initialized on mlx5_5:1
C25L18:24972:25066 [3] NCCL INFO Sharp rank 0/2 initialized on mlx5_5:1
[C25L18:0:24972 - comm.c:374] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)
[C25L19:1:19373 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)

C25L19:19373:19491 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)
C25L18:24972:25066 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

@AddyLaddy
Copy link
Collaborator

Do your IB network switches support SHARP?
Have you enabled the SHARP feature in the UFM/OpenSM configuration?

@shanleo2024
Copy link
Author

Do your IB network switches support SHARP? Have you enabled the SHARP feature in the UFM/OpenSM configuration?

Yes, my IB network switch does support SHARP, and the HCA card is CX7:
[root@C25L18 shanxs]# lspci | grep -i mell
03:00.0 Infiniband controller: Mellanox Technologies MT28861
23:00.0 Infiniband controller: Mellanox Technologies MT28861
44:00.0 Infiniband controller: Mellanox Technologies MT28861
64:00.0 Infiniband controller: Mellanox Technologies MT28861
[root@C25L18 shanxs]#

And the sharp_am srevice is already enabled on the opensm master:

[root@C27L1 ~]# ps aux |grep opensm
root 8373 0.0 0.0 217848 1076 pts/0 R+ 16:59 0:00 grep --color=auto opensm
root 18074 0.2 0.0 6932984 24728 ? Sl Mar17 2:33 /usr/sbin/opensm --daemon
[root@C27L1 ~]# service sharp_am status
Redirecting to /bin/systemctl status sharp_am.service
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.0.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: active (running) since Sun 2024-03-17 22:07:23 UTC; 18h ago
Main PID: 18222 (sharp_am)
Tasks: 40 (limit: 26213)
Memory: 31.7M
CGroup: /system.slice/sharp_am.service
└─18222 /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg

Mar 17 22:07:23 C27L1 sharp_am[18222]: Package: sharp-rc3
Mar 17 22:07:23 C27L1 sharp_am[18222]: Version: 3.0.0
Mar 17 22:07:23 C27L1 sharp_am[18222]: Build Date: Jul 20 2022
Mar 17 22:07:23 C27L1 sharp_am[18222]: Last commit: cf51a32
Mar 17 22:07:23 C27L1 sharp_am[18222]: IBIS last commit: 3c41903
Mar 17 22:07:23 C27L1 sharp_am[18222]: Log verbosity: 2
Mar 17 22:07:23 C27L1 sharp_am[18222]: Syslog verbosity: 1
Mar 17 22:07:23 C27L1 sharp_am[18222]: Command line: /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg
Mar 17 22:07:24 C27L1 sharp_am[18222]: There is not a single tree that spans over all leafs.
Mar 17 22:07:24 C27L1 sharp_am[18222]: Built 2 trees.
[root@C27L1 ~]#
[root@C27L1 ~]#

@Lzhang-hub
Copy link

@shanleo1986 I have same issue, Have you solved? Beside, I run nccl-test with sharp is normal, but get this error when run megatron-lm gpt3 example.

@liuxingbo12138
Copy link

@shanleo1986 I have same issue, Have you solved? Beside, I run nccl-test with sharp is normal, but get this error when run megatron-lm gpt3 example.

image
me too, i use ngc to run megatrom-llm with sharp failed, do you reslove it?

@Lzhang-hub
Copy link

@liuxingbo12138 try add use_sharp=True in initialize_model_parallel

@shanleo2024
Copy link
Author

Did anyone run the IB sharp successfully? Hope to get some help. Thanks.

@shanleo2024
Copy link
Author

@AddyLaddy Hi AddyLaddy,
Did the NCCL and NCCL IB sharp support multiple IB HCAs currentlly?
IB sharp only work on my setup when using one HCA.

@AddyLaddy
Copy link
Collaborator

Yes, we use NCCL IB SHARP on 8x CX-7 NICS concurrently on our internal machines, but our network uses a rail optimized topology for the first layers.

@AddyLaddy
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants