Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't find nccl when building from source #28

Open
KnowingNothing opened this issue Aug 3, 2024 · 5 comments
Open

[BUG] Can't find nccl when building from source #28

KnowingNothing opened this issue Aug 3, 2024 · 5 comments
Assignees

Comments

@KnowingNothing
Copy link

Describe the bug
A clear and concise description of what the bug is.

Can't find libnccl.so when building from source. It seems flux only builds static nccl lib instead of shared lib. But reduce_scatter requires shared nccl lib.

To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

run ./build.sh --arch 80

Expected behavior
A clear and concise description of what you expected to happen.

link fails. Cannot find -lnccl

Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.

Environment

Linux hina 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

A100 80GB PCIE 8 cards

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

@KnowingNothing
Copy link
Author

I tried to fix this by adding the following contents:

add

find_library(NCCL_LIB
             NAMES nccl_static
             PATHS ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib
             NO_DEFAULT_PATH)
if (NCCL_LIB)
  message(STATUS "Found nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
else()
  message(STATUS "Can't find nccl static lib in " ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/lib)
endif()
target_include_directories(${LIB_NAME} PRIVATE ${PROJECT_SOURCE_DIR}/3rdparty/nccl/build/include)
target_link_libraries(${LIB_NAME} PUBLIC ${NCCL_LIB})

to flux/src/reduce_scatter/CMakeLists.txt (after line 17)

change include_dirs = [root_path / "include", root_path / "src"] to include_dirs = [root_path / "include", root_path / "src", root_path / "3rdparty/nccl/build/include"] in file flux/setup.py line 128.

@wenlei-bao
Copy link
Collaborator

cc @zheng-ningxin

@zheng-ningxin zheng-ningxin self-assigned this Aug 9, 2024
@wenlei-bao
Copy link
Collaborator

@KnowingNothing Does this still apply ? or no

@wenlei-bao
Copy link
Collaborator

@KnowingNothing does this still apply?

@Zhuohao-Li
Copy link

I also met the same issues, solved by conda install -c nvidia nccl

also checked if the path of nccl is included. It should be somewhere in /usr/local but it sometimes not there

For me it is in either a venv environment and /usr/lib/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants