Toolkit DaemonSet stuck in init phase after upgrade #567

heilerich · 2023-08-09T18:02:27Z

Symptoms

After the upgrade to v23.6.0 of the operator all deployments are stuck because the nvidia-container-toolkit-daemonset DaemonSet stays in the Init:0/1 phase indefinitely. The logs of the init stage indicate that the driver-validation container is trying to execute modprobe on the nvidia kmod which fails because the container does not ship with the kernel modules (they are loaded by the driver daemonset).

Issue

I believe the issue is cause by the fixes to #430 / #485 introduced in 84ef9b3. The new module for creating the char devices also has a function to load the kernel modules. With 8906259 (ping @elezar) this is explicitly activated here

gpu-operator/validator/main.go

Line 717 in 25d6f8d

devchar.WithLoadKernelModules(true),

As mentioned above I believe that this can't work in the validation container unless I am missing something and I also do not understand why it would be necessary since the loading of the modules is handled by the driver daemonset.

Proposed solution

I think the code above in the validators main.go should be set to devchar.WithLoadKernelModules(false). We have deployed a copy of the v23.6.0 container with this patch in our environment and everything seems to work fine.

diff --git a/validator/main.go b/validator/main.go
index 4742834d..9fee8103 100644
--- a/validator/main.go
+++ b/validator/main.go
@@ -714,7 +714,7 @@ func createDevCharSymlinks(isHostDriver bool, driverRoot string) error {
                devchar.WithDevCharPath(hostDevCharPath),
                devchar.WithCreateAll(true),
                devchar.WithCreateDeviceNodes(true),
-               devchar.WithLoadKernelModules(true),
+               devchar.WithLoadKernelModules(false),
        )

time="2023-08-09T20:58:41Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-09T20:58:41Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.122-flatcar\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

shivamerla · 2023-08-10T02:47:08Z

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

@heilerich this was added for cases with pre-installed driver, where all necessary modules might not be loaded when a GPU pod is scheduled or during node reboot. With driver-container, we can skip this.

@elezar yes, it makes sense to add an option to disable this.

shivamerla · 2023-08-10T02:49:09Z

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

heilerich · 2023-08-10T08:28:21Z

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

No the driver is installed using the driver-container

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

Ah, I totally missed the chroot. That must be it. The flatcar driver-container does not copy the modules to /lib/modules in its root filesystem, but loads it from /opt/nvidia/${DRIVER_VERSION} using modprobe -b. That's why the modprobe fails even when chrooting into the driver root. So bascially this is a mismatch between the expecatation of how the (maintained) driver containers should look like and how the flatcar driver container actually works (I realise the flatcar container is not officially supported). So I think we can fix this by patching the driver container, of which we already maintain a fork because of various other issues.

@elezar yes, it makes sense to add an option to disable this.

I guess we can close this issue, unless you want to still add said option. I still think this would make sense, at least I would use it since this step is clearly unnecessary in our environment and a potential source of problems.

I also want to note that there does not seem to be an option to change the log level for the validator cmd (unless I missed something again). Would have also been helpful here :-)

Anyways, I appreciate the quick reaction.

shivamerla · 2023-08-29T06:10:10Z

@heilerich we will consider these enhancements. Thanks

chiragjn · 2024-01-13T13:21:57Z

Any plans to take this up? We have similar requirements to be able to disable some validation init containers

shivamerla added the enhancement label Aug 29, 2023

elezar mentioned this issue Jan 25, 2024

no runtime for "nvidia" is configured #662

Open

10 tasks

ArangoGutierrez removed the enhancement label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toolkit DaemonSet stuck in init phase after upgrade #567

Toolkit DaemonSet stuck in init phase after upgrade #567

heilerich commented Aug 9, 2023

elezar commented Aug 9, 2023

heilerich commented Aug 9, 2023

shivamerla commented Aug 10, 2023

shivamerla commented Aug 10, 2023

heilerich commented Aug 10, 2023

shivamerla commented Aug 29, 2023

chiragjn commented Jan 13, 2024

Toolkit DaemonSet stuck in init phase after upgrade #567

Toolkit DaemonSet stuck in init phase after upgrade #567

Comments

heilerich commented Aug 9, 2023

Symptoms

Issue

Proposed solution

Related

elezar commented Aug 9, 2023

heilerich commented Aug 9, 2023

shivamerla commented Aug 10, 2023

shivamerla commented Aug 10, 2023

heilerich commented Aug 10, 2023

shivamerla commented Aug 29, 2023

chiragjn commented Jan 13, 2024