Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA driver version mismatched with CUDA runtime version #343

Closed
loligans opened this issue May 14, 2024 · 1 comment · Fixed by #387 · May be fixed by #345
Closed

CUDA driver version mismatched with CUDA runtime version #343

loligans opened this issue May 14, 2024 · 1 comment · Fixed by #387 · May be fixed by #345
Labels
bug Something isn't working

Comments

@loligans
Copy link

loligans commented May 14, 2024

The GPU Driver is using CUDA 12.2 but the CUDA runtime installed (nvcc) is 12.4

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000001:00:00.0 Off |                    0 |
| N/A   28C    P0              76W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

The mismatch of CUDA versions causes GPU_Burn to hang. I believe the GPU driver should be updated to 550.54.15

If the intended CUDA version is 12.2 then the GPU driver can remain as 535.161.08, but the CUDA runtime should be downgraded to 12.2

If the intended CUDA version is 12.4 then the GPU driver should be updated to 550.54.15

Related issue: wilicc/gpu-burn#7

@LiquidPT
Copy link
Contributor

There were issues with Fabric Manager 550.54.15, so we had to revert FM and the GPU driver. As per NVIDIA, this version of CUDA should be compatible with the GPU driver:

https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-comaptibility

CUDA 12.4 has some critical fixes, so using the newer version is preferable.

@LiquidPT LiquidPT closed this as completed Aug 8, 2024
@LiquidPT LiquidPT reopened this Oct 16, 2024
This was referenced Oct 17, 2024
@LiquidPT LiquidPT linked a pull request Oct 17, 2024 that will close this issue
@LiquidPT LiquidPT added the bug Something isn't working label Oct 17, 2024
LiquidPT added a commit that referenced this issue Oct 18, 2024
- #343 
    - Added a test to confirm the fix
    - Pull GDRCopy from master for bug fix to be compatible with newer NVIDIA driver versions
- OpenMPI missing hcoll lib path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants