-
Environmental Info: Node(s) CPU architecture, OS, and Version: Linux yotsugi 6.2.16-15-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-15 (2023-09-28T13:53Z) x86_64 GNU/Linux Cluster Configuration: 1 server Describe the bug: Anything I run from nvidia seems to get the error: Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory Therefore the nvidia gpu does not detect properly Steps To Reproduce: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
K3s does not include or link against any Nvidia libraries, it just makes available to containerd the Nvidia container runtimes, if found on the host. It sounds like you're still missing packages, and the Nvidia container runtimes are failing to run. I would refer you back to the Nvidia docs on what to install on your operating system. |
Beta Was this translation helpful? Give feedback.
-
Then why would it work perfectly fine on docker and on the host if I'm missing something? I checked on the host and I do have the shared object that it says it's missing. |
Beta Was this translation helpful? Give feedback.
-
So I got it working (under kubeadm with crio but this should work with k3s as well). Turns out I'm not the only one who discovered this with similar issues. As described here and partly in the k3s advanced docs you need to do a couple things after installing your gpu driver and setting up the nvidia-container-runtime.
All in all not a k3s problem but it seems like there's some weird stuff that needs to be adjusted before you can run nvidia in kubernetes at this moment. |
Beta Was this translation helpful? Give feedback.
So I got it working (under kubeadm with crio but this should work with k3s as well). Turns out I'm not the only one who discovered this with similar issues. As described here and partly in the k3s advanced docs you need to do a couple things after installing your gpu driver and setting up the nvidia-container-runtime.
runtimeClassName: nvidia
to it in the pod spec (I cut out most of the yaml but this is where it should go):