k3s with nvidia gpu not working as intended #8596

unixbird · 2023-10-12T02:34:40Z

unixbird
Oct 12, 2023

Environmental Info:
K3s Version: k3s version v1.27.6+k3s1 (bd04941)

Node(s) CPU architecture, OS, and Version: Linux yotsugi 6.2.16-15-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-15 (2023-09-28T13:53Z) x86_64 GNU/Linux

Cluster Configuration: 1 server

Describe the bug:

Anything I run from nvidia seems to get the error: Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

Therefore the nvidia gpu does not detect properly

Steps To Reproduce:
Follow the instructions here https://docs.k3s.io/advanced and attempted to install the NFD and GFD and it says that it cannot detect this and gives a missing shared object like above. If attempting to run anything that requests an NVIDIA GPU it will stay in a pending state. I can run nvidia-smi in docker and on the host itself to isolate the issue.

Answered by unixbird

Oct 14, 2023

So I got it working (under kubeadm with crio but this should work with k3s as well). Turns out I'm not the only one who discovered this with similar issues. As described here and partly in the k3s advanced docs you need to do a couple things after installing your gpu driver and setting up the nvidia-container-runtime.

Separate the nvidia runtime kind block from the example k3s advanced options and deploy it.
You need to edit the nvidia-device-plugin from this yaml and add the runtimeClassName: nvidia to it in the pod spec (I cut out most of the yaml but this is where it should go):

spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
 …

View full answer

brandond · 2023-10-12T04:19:30Z

brandond
Oct 12, 2023
Collaborator

K3s does not include or link against any Nvidia libraries, it just makes available to containerd the Nvidia container runtimes, if found on the host.

It sounds like you're still missing packages, and the Nvidia container runtimes are failing to run. I would refer you back to the Nvidia docs on what to install on your operating system.

0 replies

unixbird · 2023-10-12T04:31:08Z

unixbird
Oct 12, 2023
Author

K3s does not include or link against any Nvidia libraries, it just makes available to containerd the Nvidia container runtimes, if found on the host.

It sounds like you're still missing packages, and the Nvidia container runtimes are failing to run. I would refer you back to the Nvidia docs on what to install on your operating system.

Then why would it work perfectly fine on docker and on the host if I'm missing something? I checked on the host and I do have the shared object that it says it's missing.

5 replies

brandond Oct 12, 2023
Collaborator

Can you run the Nvidia container runtime binaries that containerd uses? Have you exported LD_LIBRARY_PATH or anything else like within your shell that to get things working?

unixbird Oct 12, 2023
Author

As far as Dockers containerd it works but I'm not sure.
I checked the config for the nvidia-ctk and found this:

ldconfig = "@/sbin/ldconfig"

unixbird Oct 13, 2023
Author

for reference I changed that to "/sbin/ldconfig" and nothing really changed

brandond Oct 13, 2023
Collaborator

I don't know what distro yotsugi 6.2.16-15-pve would be from. What OS is this, and what packages did you install? Are you sure that they are appropriate for your operating system? I'm not sure how much I can help here, given that all the components you're trying to use are part of the nvidia container runtime or operator, not k3s.

unixbird Oct 13, 2023
Author

For reference this is Proxmox which is more or less just Debian 12 and yotsugi is just the hostname. For background I ran the .run driver to install the drivers for the Nvidia card itself and installed the nvidia-container-runtime toolkit via the nvidia repos. I'm slowly narrowing this down as I ended up setting up kubeadm with CRIO to test and received the same errors. I realized I haven't rebooted the machine after setting up the nvidia-container-runtime which seemed to help some people so I'll report back if that helps

unixbird · 2023-10-14T04:08:09Z

unixbird
Oct 14, 2023
Author

So I got it working (under kubeadm with crio but this should work with k3s as well). Turns out I'm not the only one who discovered this with similar issues. As described here and partly in the k3s advanced docs you need to do a couple things after installing your gpu driver and setting up the nvidia-container-runtime.

Separate the nvidia runtime kind block from the example k3s advanced options and deploy it.
You need to edit the nvidia-device-plugin from this yaml and add the runtimeClassName: nvidia to it in the pod spec (I cut out most of the yaml but this is where it should go):

spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critic>
      priorityClassName: "system-node-critical"
      runtimeClassName: nvidia
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        env:

After you edit this deploy it and everything should be detected. You can run the example yaml in the k3s advanced minus the nvidia runtime portion now and it'll complete and give you the gpu benchmark stats.

All in all not a k3s problem but it seems like there's some weird stuff that needs to be adjusted before you can run nvidia in kubernetes at this moment.

1 reply

neofob Jan 13, 2024

@unixbird : Thanks for the info. That patch, adding runtimeClassName: nvidia to plugin yaml file, works for me in K3s.

K3S

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s with nvidia gpu not working as intended #8596

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

k3s with nvidia gpu not working as intended #8596

unixbird Oct 12, 2023

Replies: 3 comments · 6 replies

brandond Oct 12, 2023 Collaborator

unixbird Oct 12, 2023 Author

brandond Oct 12, 2023 Collaborator

unixbird Oct 12, 2023 Author

unixbird Oct 13, 2023 Author

brandond Oct 13, 2023 Collaborator

unixbird Oct 13, 2023 Author

unixbird Oct 14, 2023 Author

neofob Jan 13, 2024

unixbird
Oct 12, 2023

Replies: 3 comments 6 replies

brandond
Oct 12, 2023
Collaborator

unixbird
Oct 12, 2023
Author

brandond Oct 12, 2023
Collaborator

unixbird Oct 12, 2023
Author

unixbird Oct 13, 2023
Author

brandond Oct 13, 2023
Collaborator

unixbird Oct 13, 2023
Author

unixbird
Oct 14, 2023
Author