-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toolkit DaemonSet stuck in init phase after upgrade #567
Comments
@cdesiniotis @shivamerla any thoughts on this? Would exposing this as a config option (possibly changing the default back to the previous value) make sense? @heilerich do you have logs available that show the failure to load the kernel modules? |
Log file looks like this
|
we set the @heilerich this was added for cases with pre-installed driver, where all necessary modules might not be loaded when a GPU pod is scheduled or during node reboot. With driver-container, we can skip this. @elezar yes, it makes sense to add an option to disable this. |
@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux? |
No the driver is installed using the driver-container
Ah, I totally missed the chroot. That must be it. The flatcar driver-container does not copy the modules to
I guess we can close this issue, unless you want to still add said option. I still think this would make sense, at least I would use it since this step is clearly unnecessary in our environment and a potential source of problems. I also want to note that there does not seem to be an option to change the log level for the validator cmd (unless I missed something again). Would have also been helpful here :-) Anyways, I appreciate the quick reaction. |
@heilerich we will consider these enhancements. Thanks |
Any plans to take this up? We have similar requirements to be able to disable some validation init containers |
Symptoms
After the upgrade to v23.6.0 of the operator all deployments are stuck because the
nvidia-container-toolkit-daemonset
DaemonSet stays in theInit:0/1
phase indefinitely. The logs of the init stage indicate that the driver-validation container is trying to execute modprobe on the nvidia kmod which fails because the container does not ship with the kernel modules (they are loaded by the driver daemonset).Issue
I believe the issue is cause by the fixes to #430 / #485 introduced in 84ef9b3. The new module for creating the char devices also has a function to load the kernel modules. With 8906259 (ping @elezar) this is explicitly activated here
gpu-operator/validator/main.go
Line 717 in 25d6f8d
As mentioned above I believe that this can't work in the validation container unless I am missing something and I also do not understand why it would be necessary since the loading of the modules is handled by the driver daemonset.
Proposed solution
I think the code above in the validators
main.go
should be set todevchar.WithLoadKernelModules(false)
. We have deployed a copy of the v23.6.0 container with this patch in our environment and everything seems to work fine.Related
Issue #552 might also be caused by this. There v23.3.2 is used which is not setting the
devchar.WithLoadKernelModules
explicitly, but I think the default value in this version might betrue
. I have not verified this since we have never usedv23.3.2
.The text was updated successfully, but these errors were encountered: