Installation of the Library #54

i3s93 · 2024-07-25T19:05:17Z

I would like to use tools from this library in one of my projects, but I'm having some difficulties with the installation process on a Linux cluster.

I have extracted and set the library path to the shared object files for cuDSS following the directions given here. After installing CUDSS.jl, I tried to execute the following test:

using CUDA, CUDA.CUSPARSE, CUDSS, LinearAlgebra, SparseArrays
A = CuSparseMatrixCSR(sprand(100, 100, 0.1))
solver = CudssSolver(A, "G", 'F')

On the third line, I receive the following error message:

ERROR: UndefVarError: `libcudss` not defined

I'm not sure what I am doing wrong. I have also tried setting the environment variable JULIA_CUDSS_LIBRARY_PATH which is used to set the path for libcudss. Something is not being set properly. I'm using CUDA.jl (v5.4.3) and CUDSS.jl (v0.3.1) on Julia v1.9, if that helps.

The text was updated successfully, but these errors were encountered:

amontoison · 2024-07-25T19:17:27Z

@i3s93 You don't need to install anything related to the source code of cuDSS.
We have an artifact system that allow to download and install cuDSS for the users automatically (CUDSS_jll.jl).

You just need

julia> ]
pkg> add CUDSS

It's explained in the README.md but I should add a note that it also installs the shared library.
You should be able to run any Julia example after that.

i3s93 · 2024-07-25T19:31:54Z

Thank you @amontoison for your rapid response. I actually started with the base installation in the README.md, but encountered the same error message. That is why I tried to manually set the path, but neither approach worked for me. Here is what I see on my end when I execute the code from my previous comment:

ERROR: UndefVarError: `libcudss` not defined
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:218 [inlined]
  [2] macro expansion
    @ ~/.julia/packages/CUDSS/2E89a/src/libcudss.jl:245 [inlined]
  [3] #31
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:35 [inlined]
  [4] retry_reclaim(f::CUDSS.var"#31#32"{Base.RefValue{Ptr{CUDSS.cudssMatrix}}, Int64, Int64, Int32, CuArray{Int32, 1, CUDA.DeviceMemory}, CuPtr{Nothing}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, DataType, DataType, String, Char, Char}, retry_if::CUDSS.var"#retry_if#49")
    @ CUDA ~/.julia/packages/CUDA/Tl08O/src/memory.jl:434
  [5] check
    @ ~/.julia/packages/CUDSS/2E89a/src/error.jl:45 [inlined]
  [6] cudssMatrixCreateCsr
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
  [7] CudssMatrix(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char; index::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:81
  [8] CudssMatrix
    @ ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:78 [inlined]
  [9] _
    @ ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:40 [inlined]
 [10] CudssSolver(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:39
 [11] top-level scope
    @ REPL[3]:1

amontoison · 2024-07-25T19:58:28Z

Can you remove the environment variable JULIA_CUDSS_LIBRARY_PATH and try to recompile CUDSS.jl with:

force_recompile(package_name::String) = Base.compilecache(Base.identify_package(package_name))
force_recompile("CUDSS")
using CUDSS

amontoison · 2024-07-25T19:59:55Z

If it's still not working, what is your NVIDIA GPU and operating system / architecture?

i3s93 · 2024-07-25T20:20:37Z

I tried your solution, but I'm still seeing the same problem. I'm running with an NVIDIA A100 GPU with an AMD EPYC 7763 processor. The operating system is SUSE Linux Enterprise Server 15 SP4.

amontoison · 2024-07-26T04:47:39Z

Did you install CUDSS.jl on a node with a GPU initially?
I will try to force Julia to reinstall the artifacts with:

rm -rf ~/.julia/artifacts/*

amontoison · 2024-07-26T04:53:22Z

Can you also display the output of:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=none, cuda_local=false, cxxstring_abi=cxx11, julia_version=1.10.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

On my laptop I don't have an NVIDIA GPU so the shared library of cuDSS is not installed.

Are the NVIDIA drivers installed on your computer?

i3s93 · 2024-07-26T21:41:22Z

Okay, I have removed the artifacts as you have suggested. When I installed the package, I was on a node with the A100. Here is the output you requested:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=12.2, cuda_local=true, cxxstring_abi=cxx11, julia_version=1.9.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

I still see the same error message.

i3s93 · 2024-07-26T21:46:46Z

Just to follow up, I was able to install and run the code from the package locally on a laptop with an NVIDIA GPU. So far, I have only been able to see this issue when I try to install the package on a remote cluster. I will reach out to the system administrators and see if something on their end is disrupting the installation.

carstenbauer · 2024-07-27T04:29:41Z

Are you using a module on the cluster to get Julia? (I.e. module load ...) If so, can you post the output of module show ...?

It seems that you're trying to use a local cuda. Assuming that wasn't your intention and own doing, it might be a global preference that is set when you load a Julia module.

Btw, which cluster is this?

i3s93 · 2024-07-28T19:52:28Z

@carstenbauer: This is on Perlmutter, if that helps. Here is the output of module list

Currently Loaded Modules:
  1) craype-x86-milan     3) craype-network-ofi                      5) PrgEnv-gnu/8.5.0   7) cray-libsci/23.12.5   9) craype/2.7.30    11) perftools-base/23.12.0  13) craype-accel-nvidia80  15) julia/1.9.4
  2) libfabric/1.15.2.0   4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   6) cray-dsmml/0.2.2   8) cray-mpich/8.1.28    10) gcc-native/12.3  12) cpe/23.12               14) gpu/1.0                16) cudatoolkit/12.2 (g)

  Where:
   g:  built for GPU

I can run any of my Julia CUDA codes fine without the CUDA modules, so the CUDA Toolkit is not necessary. I see the same error regardless of whether not this module is loaded.

carstenbauer · 2024-07-29T13:45:38Z

@i3s93 I just tested this on Perlmutter.

If I use the julia module (module load julia) I can reproduce your error message.

However, if I

unset JULIA_LOAD_PATH (to get rid of the global Julia preferences set by the module)
and module unload cudatoolkit (not necessary but better to avoid potential conflicts),

your test above works without any issues in a clean Julia environment that just has CUDA and CUDSS in it.

JBlaschke · 2024-07-29T13:59:15Z

The environment in the global JULIA_LOAD_PATH is used to specify the CUDA version (to stop Julia from installing a version of the CUDA runtime that is incompatible with the system) and the MPI configuration. I suspect the later has no effect here.

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

carstenbauer · 2024-07-29T14:03:30Z

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

@JBlaschke I assume the question was for me, because I was the one that did the (successful) test with unset JULIA_LOAD_PATH. And to answer it, yes, afterwards I get 12.5 (instead of 12.2):

julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.0
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

For comparison, this is if I don't unset and don't unload the cudatoolkit module:

julia> CUDA.versioninfo()
CUDA runtime 12.2, local installation
CUDA driver 12.2
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 2023.2.0 (API 20.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.4

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

JBlaschke · 2024-07-29T14:46:27Z

Thanks @carstenbauer for checking. So libcudss doesn't appear to be in the cudatoolkit module. I'll see if it's installed anywhere.

One more thing: does the artifact even work on a compute? For previous versions we would get segfaults.

JBlaschke · 2024-07-29T20:17:05Z

It looks like we don't have a version on Perlmutter yet. I might go and check the artifact install of CUDA. If that doesn't work I'd need to develop a module.

amontoison · 2024-07-30T04:48:11Z

@JBlaschke Do you mean the artifact of cuDSS?
The recent version 0.3.0 works fine without segmentation faults.

JBlaschke · 2024-07-30T12:35:20Z

@amontoison no I meant running CUDA.jl using the artifact CUDA (instead of the one provided by the OS)

JBlaschke · 2024-07-30T12:35:35Z

On Perlmutter

i3s93 · 2024-07-30T17:40:52Z

@carstenbauer Thank you for taking the time to help resolve this issue! I can also confirm that unsetting unset JULIA_LOAD_PATH worked for me.

@JBlaschke Thank you for your help as well! My tests with cuDSS are a small scale, so I am fine with unsetting the environment variable until a better solution becomes available.

@amontoison I greatly appreciate the timely feedback and for having a look at this problem. Since this does not appear to be an issue with CUDSS.jl, I'm fine with closing this issue, unless the others would like to continue the discussion!

amontoison · 2024-07-30T18:29:57Z

Am I wondering how relevant it will be to detect a local installation of cuDSS:
#55

cuDSS is still in preview so every minor release breaks the API, and it requires the local installation to be always the most recent version, which is probably hard to maintain.

JBlaschke · 2024-07-30T23:59:05Z

@amontoison in the past CUDA would not work at all unless you used the local install on Perlmutter. It might be the case that this is no longer necessary.

I haven't had a chance to test this. Will do so soon. If it is the case that running CUDA_jll is unstable on Perlmutter, then we have no choice but to also use a local CUDSS install...

amontoison · 2024-08-15T04:36:07Z

@carstenbauer @JBlaschke @i3s93
May I ask one of you to test my PR #57?
It should help to detect a local install on Perlmutter.

Do you know why Tim checks whether precompiling in this function __init__, which I based my PR on?
Is it to avoid an error when precompiling on a cluster node without GPUs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation of the Library #54

Installation of the Library #54

i3s93 commented Jul 25, 2024 •

edited

Loading

amontoison commented Jul 25, 2024 •

edited

Loading

i3s93 commented Jul 25, 2024

amontoison commented Jul 25, 2024 •

edited

Loading

amontoison commented Jul 25, 2024

i3s93 commented Jul 25, 2024

amontoison commented Jul 26, 2024

amontoison commented Jul 26, 2024 •

edited

Loading

i3s93 commented Jul 26, 2024

i3s93 commented Jul 26, 2024

carstenbauer commented Jul 27, 2024

i3s93 commented Jul 28, 2024 •

edited

Loading

carstenbauer commented Jul 29, 2024 •

edited by amontoison

Loading

JBlaschke commented Jul 29, 2024

carstenbauer commented Jul 29, 2024 •

edited

Loading

JBlaschke commented Jul 29, 2024

JBlaschke commented Jul 29, 2024

amontoison commented Jul 30, 2024

JBlaschke commented Jul 30, 2024

JBlaschke commented Jul 30, 2024

i3s93 commented Jul 30, 2024

amontoison commented Jul 30, 2024 •

edited

Loading

JBlaschke commented Jul 30, 2024

amontoison commented Aug 15, 2024 •

edited

Loading

Installation of the Library #54

Installation of the Library #54

Comments

i3s93 commented Jul 25, 2024 • edited Loading

amontoison commented Jul 25, 2024 • edited Loading

i3s93 commented Jul 25, 2024

amontoison commented Jul 25, 2024 • edited Loading

amontoison commented Jul 25, 2024

i3s93 commented Jul 25, 2024

amontoison commented Jul 26, 2024

amontoison commented Jul 26, 2024 • edited Loading

i3s93 commented Jul 26, 2024

i3s93 commented Jul 26, 2024

carstenbauer commented Jul 27, 2024

i3s93 commented Jul 28, 2024 • edited Loading

carstenbauer commented Jul 29, 2024 • edited by amontoison Loading

JBlaschke commented Jul 29, 2024

carstenbauer commented Jul 29, 2024 • edited Loading

JBlaschke commented Jul 29, 2024

JBlaschke commented Jul 29, 2024

amontoison commented Jul 30, 2024

JBlaschke commented Jul 30, 2024

JBlaschke commented Jul 30, 2024

i3s93 commented Jul 30, 2024

amontoison commented Jul 30, 2024 • edited Loading

JBlaschke commented Jul 30, 2024

amontoison commented Aug 15, 2024 • edited Loading

i3s93 commented Jul 25, 2024 •

edited

Loading

amontoison commented Jul 25, 2024 •

edited

Loading

amontoison commented Jul 25, 2024 •

edited

Loading

amontoison commented Jul 26, 2024 •

edited

Loading

i3s93 commented Jul 28, 2024 •

edited

Loading

carstenbauer commented Jul 29, 2024 •

edited by amontoison

Loading

carstenbauer commented Jul 29, 2024 •

edited

Loading

amontoison commented Jul 30, 2024 •

edited

Loading

amontoison commented Aug 15, 2024 •

edited

Loading