Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia-driver-daemonset stuck in CrashLoopBackOff #1016

Open
CarlGJ opened this issue Sep 27, 2024 · 0 comments
Open

Nvidia-driver-daemonset stuck in CrashLoopBackOff #1016

CarlGJ opened this issue Sep 27, 2024 · 0 comments

Comments

@CarlGJ
Copy link

CarlGJ commented Sep 27, 2024

Platform: Openshift 4.12
Version of the GPU Operator: 22.9.2
GPU: Tesla T4
Problem:After the creation of the ClusterPolicy we the Driver-daemonset pod enters CrashLoopBackoff with the following logs:
(also complete logs attached)
`This is the sum up of what I saw wrong in the logs :

Library errors

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

Error of entitlement

This system is not registered with an entitlement server. You can use subscription-manager to register
Errores downloading RHEL 8
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-eus-rpms':
185
- Status code: 404 for https://cdn.redhat.com/content/eus/rhel8/8.10/x86_64/baseos/os/repodata/repomd.xml (IP: 88.221.44.251)
186
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-eus-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
187
+ dnf config-manager --set-disabled rhel-8-for-x86_64-baseos-eus-rpms

Code errors

+ make -s -j SYSSRC=/lib/modules/4.18.0-553.16.1.el8_10.x86_64/build nv-linux.o nv-modeset-linux.o
/usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c: In function 'test_unexpected_completed_values':
/usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c:156:15: warning: unused variable 'status' [-Wunused-variable]
NV_STATUS status;
^~~~~~
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c:678:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
struct nv_drm_plane_state *nv_drm_plane_state =
^~~~~~
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-drv.c: In function 'nv_drm_init_mode_config':
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-drv.c:262:22: error: 'struct drm_mode_config' has no member named 'fb_base'; did you mean 'fb_list'?
dev->mode_config.fb_base = 0;
^~~~~~~
fb_list
make[2]: *** [scripts/Makefile.build:317: /usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-drv.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-connector.c: In function '__nv_drm_detect_encoder':
/usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-connector.c18: error: 'struct drm_connector' has no member named 'override_edid'
if (connector->override_edid) {(`

Then it unloads the driver and restarts

Driver_daemonset.docx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant