Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolkit DaemonSet stuck in init phase after upgrade #567

Open
heilerich opened this issue Aug 9, 2023 · 7 comments
Open

Toolkit DaemonSet stuck in init phase after upgrade #567

heilerich opened this issue Aug 9, 2023 · 7 comments

Comments

@heilerich
Copy link

Symptoms

After the upgrade to v23.6.0 of the operator all deployments are stuck because the nvidia-container-toolkit-daemonset DaemonSet stays in the Init:0/1 phase indefinitely. The logs of the init stage indicate that the driver-validation container is trying to execute modprobe on the nvidia kmod which fails because the container does not ship with the kernel modules (they are loaded by the driver daemonset).

Issue

I believe the issue is cause by the fixes to #430 / #485 introduced in 84ef9b3. The new module for creating the char devices also has a function to load the kernel modules. With 8906259 (ping @elezar) this is explicitly activated here

devchar.WithLoadKernelModules(true),

As mentioned above I believe that this can't work in the validation container unless I am missing something and I also do not understand why it would be necessary since the loading of the modules is handled by the driver daemonset.

Proposed solution

I think the code above in the validators main.go should be set to devchar.WithLoadKernelModules(false). We have deployed a copy of the v23.6.0 container with this patch in our environment and everything seems to work fine.

diff --git a/validator/main.go b/validator/main.go
index 4742834d..9fee8103 100644
--- a/validator/main.go
+++ b/validator/main.go
@@ -714,7 +714,7 @@ func createDevCharSymlinks(isHostDriver bool, driverRoot string) error {
                devchar.WithDevCharPath(hostDevCharPath),
                devchar.WithCreateAll(true),
                devchar.WithCreateDeviceNodes(true),
-               devchar.WithLoadKernelModules(true),
+               devchar.WithLoadKernelModules(false),
        )

Related

Issue #552 might also be caused by this. There v23.3.2 is used which is not setting the devchar.WithLoadKernelModules explicitly, but I think the default value in this version might be true. I have not verified this since we have never used v23.3.2.

@elezar
Copy link
Member

elezar commented Aug 9, 2023

@cdesiniotis @shivamerla any thoughts on this? Would exposing this as a config option (possibly changing the default back to the previous value) make sense?

@heilerich do you have logs available that show the failure to load the kernel modules?

@heilerich
Copy link
Author

Log file looks like this

time="2023-08-09T20:58:41Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-09T20:58:41Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.122-flatcar\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

@shivamerla
Copy link
Contributor

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

@heilerich this was added for cases with pre-installed driver, where all necessary modules might not be loaded when a GPU pod is scheduled or during node reboot. With driver-container, we can skip this.

@elezar yes, it makes sense to add an option to disable this.

@shivamerla
Copy link
Contributor

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

@heilerich
Copy link
Author

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

No the driver is installed using the driver-container

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

Ah, I totally missed the chroot. That must be it. The flatcar driver-container does not copy the modules to /lib/modules in its root filesystem, but loads it from /opt/nvidia/${DRIVER_VERSION} using modprobe -b. That's why the modprobe fails even when chrooting into the driver root. So bascially this is a mismatch between the expecatation of how the (maintained) driver containers should look like and how the flatcar driver container actually works (I realise the flatcar container is not officially supported). So I think we can fix this by patching the driver container, of which we already maintain a fork because of various other issues.

@elezar yes, it makes sense to add an option to disable this.

I guess we can close this issue, unless you want to still add said option. I still think this would make sense, at least I would use it since this step is clearly unnecessary in our environment and a potential source of problems.

I also want to note that there does not seem to be an option to change the log level for the validator cmd (unless I missed something again). Would have also been helpful here :-)

Anyways, I appreciate the quick reaction.

@shivamerla
Copy link
Contributor

@heilerich we will consider these enhancements. Thanks

@chiragjn
Copy link

Any plans to take this up? We have similar requirements to be able to disable some validation init containers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants