Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

Open
runitmisra opened this issue Jan 22, 2025 · 12 comments
Open

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

runitmisra opened this issue Jan 22, 2025 · 12 comments

Comments

@runitmisra
Copy link

We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29

We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:

  • Make sure gpu-operator version is compatible with Kubernetes version
  • Get a supported Ubuntu 22.04 EKS AMI from https://cloud-images.ubuntu.com/aws-eks/ as mentioned in this doc
  • Upgrade the gpu node group with the AMI obtained from the above step. The AMI is ubuntu 22.04 with kernel version 6.5 (we tried with an AMI with kernel version 6.8 as well). Ideally any ubuntu 22.04 x86_64 AMI in the list should work with the gpu operator just fine.

We started seeing gpu operator pods going into Error and CrashLoopBackOff state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.

Basic details:

Kubernetes version: v1.30
GPU Operator version: v24.6.2
GPU Driver version: v535.183.01
Ubuntu AMI Name: ubuntu-eks/k8s_1.30/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240526
Kernel Version: 6.5.0-1020-aws

Here is the pod status on the v1.30 node:

$ kubectl get pods -n gpu-operator --field-selector spec.nodeName=ip-10-2-87-217.ec2.internal                                                
NAME                                               READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-r88m2                        0/2     Init:0/2           0             2m30s
gpu-operator-node-feature-discovery-worker-qh45g   1/1     Running            0             4h8m
nvidia-container-toolkit-daemonset-wnmsk           0/1     Init:0/1           0             2m30s
nvidia-dcgm-exporter-4q62c                         0/1     Init:0/1           0             2m30s
nvidia-device-plugin-daemonset-ngvbq               0/2     Init:0/2           0             2m30s
nvidia-driver-daemonset-n8m56                      0/1     CrashLoopBackOff   3 (36s ago)   2m33s
nvidia-operator-validator-mwp76                    0/1     Init:0/4           0             2m30s

These pods keep terminating, crashing and getting recreated over and over.

Here are some more logs and info that might help:
Describe nvidia-operator-validator pod - Events show this error

Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               107s               default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-fqt2z to ip-10-2-87-217.ec2.internal
  Warning  FailedCreatePodSandBox  1s (x9 over 107s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Logs of driver daemonset pod:

$ kubectl logs -n gpu-operator nvidia-driver-daemonset-bn5hn -f                                                                              
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.183.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.183.01........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.183.01 for Linux kernel version 6.5.0-1020-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Pod terminates and crashes after this.

I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.

  • Finding out what is the kernel version of any given AMI is difficult. If there is a strict kernel version requirement, there should be a mention in the documentation where the user can find the proper AMI for it.
  • The documentation only mentions kernel version requirements once, which includes 5.15 and 6.5. 5.15 no longer ships in AMIs for k8s v1.30 and we tested with 6.5 and it does not work. In Amazon EKS specific docs, there is no mention of kernel version requirements and it states as long as you have a ubuntu 22.04 x86_64 image, you're good.
@runitmisra
Copy link
Author

We also plan to use pre-compiled driver images going forward, but again, no kernel version newer than 5.15 is supported. We are having trouble finding an AMI which is Ubuntu 22.04, has the 5.15 kernel and is compatible with k8s v1.30 and v1.31.

@mukulgit123
Copy link

It's a bit confusing from the docs that, for the version mentioned by you, under the supported operating systems and Kubernetes platforms here, it's mentioned, under the Cloud Service Providers tab, that EKS is supported from v1.25-v1.28. I really doubt if it's the case, since it worked fine on v1.29 for you and only started failing with v1.30. If the document is correct, I will have to think multiple times before upgrading to any version beyond 1.28.

Image

@runitmisra
Copy link
Author

The compatibility issue in my opinion is the kernel version. Nvidia does not provide driver support (either normal or pre-compiled) for any kernel version > 5.15 and Ubuntu does not provide an AMI which is both compatible with k8s v1.30+ AND has kernel v5.15! Same case for pre-compiled drivers

So my question here is: Is there NO WAY to run gpu-operator managed clusters reliably on k8s v1.30 and above??

@francisguillier
Copy link
Contributor

hi Runit,

the standard driver container (using CUDA runfile as underlying executable) should normally work.
Since GPU operator 24.6, we have added the logic in the driver container to support any kernel compiled with gcc 11 or 12.
It means kernel 5.15, 6.5 and 6.8 should work with the standard driver container.
Could you give a try?

for the precompiled driver container, today we only publish container images for Ubuntu 22.04 with 5.15 kernel.
I guess you know this:
if you go to https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags
and search for 'aws', you will see the precompiled container images we produce for 5.15 aws variant kernel.

The main reasons we produced the precompiled driver images are for:
-secure boot
-total air gap environment.
Would you be in one of those cases?

For the doc issue, it's just a typo. We support currently 1.29 -> 1.31.
Older versions (down to 1.24) should work also as we don't use specific resources available in a particular K8s release.

Hope this helps.

@runitmisra
Copy link
Author

Hello @francisguillier, Thank you for the clarification it helps a lot.

  • I did try the standard driver container which fails. I have described it in this issue's description. Please let me know if you need more information from my side.

  • We noticed that pre-compiled driver images are only available for Ubuntu 22.04 with 5.15 kernel. The problem here is we cannot find an EKS AMI which ticks all the boxes for compatibility. i.e,

    • Compatible with k8s version we want to upgrade to(v1.30 and v1.31)
    • Is ubuntu 22.04
    • Has the 5.15 kernel.
      Images that have this kernel are either not k8s version compatible or they are ubuntu 20.04.
  • Is there any reason NVIDIA does not produce pre-compiled driver images for kernels other than 5.15? AMIs that are compatible with newer k8s versions do not ship this kernel. I am curious why the strict requirement.

  • I have also tried building a driver image myself by supplying build arg for the proper kernel version, but it fails. I guess I'll have to raise that issue on the relevant repo here: https://github.com/NVIDIA/gpu-driver-container

@francisguillier
Copy link
Contributor

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.183.01 for Linux kernel version 6.5.0-1020-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version

the runfile will try to fetch the linux-headers-6.5.0-1020-aws package from Canonical repo in order to proceed with the runtime compilation of the kernel modules.
I can see the package existing on Canonical repo:

$ apt search linux-headers | grep 6.5.0-1020-aws
linux-headers-6.5.0-1020-aws/jammy-updates,jammy-security 6.5.0-1020.20~22.04.1 amd64
$ apt-cache policy linux-headers-6.5.0-1020-aws
linux-headers-6.5.0-1020-aws:
  Installed: (none)
  Candidate: 6.5.0-1020.20~22.04.1
  Version table:
     6.5.0-1020.20~22.04.1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu jammy-security/main amd64 Packages

Can you check if you have a Security Group that would prevent the EKS node to reach out to us.archive.ubuntu.com?

For the question dealing with producing the precompiled driver container images for kernel other than 5.15, we followed the kernel release lifecycle from Canonical: https://ubuntu.com/kernel/lifecycle (non HWE only). 5.15 was the main kernel for Ubuntu 22.04.
We will support Ubuntu 24.04 with the next release of GPU operator and we will produce precompiled driver images for 6.8 kernels at that time.

@runitmisra
Copy link
Author

runitmisra commented Feb 3, 2025

Hello @francisguillier, Thank you for dedicating your time to this. You are correct, the standard version works with other kernel versions as well. When we added nodes with k8s v1.31 compatible Ubuntu 22.04 AMIs with kernel v6.8, the standard gpu driver container started working just fine.

We were also able to build a custom pre-compiled driver image for kernel 6.8 and Ubuntu 22.04.

I have just one more question: Does Nvidia not support pre-compiled images for GCP nodes? I could not find any gke imaged on nvcr. Is it possible to build custom images for GCP kernels as well?

@francisguillier
Copy link
Contributor

Hi Runit,

Glad to see driver container works now in your environment.

For the question about precompiled driver container for GCP/GKE kernel variant, we cannot build it simply because associated packages are not available on Canonical repo servers. Only generic, AWS, Azure and Oracle are available and that's why we were able to produce the precompiled driver images for those CSP managed k8s services.

@v1nc3nt27
Copy link

v1nc3nt27 commented Feb 6, 2025

I am seeing the almost same logs as @runitmisra.

DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-550.144.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.144.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 550.144.03 for Linux kernel version 5.15.0-53-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

I have a CCE cluster (k8s 1.29) on Open Telekom Cloud. The image i am using is ubuntu 22 with kernel 5.15.0-53-generic. So, as far as I understand, this should actually work.

I don't have many other options as the cloud provider only has HCE OS 2.0 and Euler OS 2.9 (CentOS based) which are not compatible with the GPU operator I think.

Do you have any ideas @francisguillier ?

@tariq1890
Copy link
Contributor

tariq1890 commented Feb 6, 2025

@v1nc3nt27 Can you update the kernel version and try again?

The linux headers package for that kernel doesn't seem to be available anymore

$ apt-cache show linux-headers-5.15.0-53-generic
N: Unable to locate package linux-headers-5.15.0-53-generic
N: Couldn't find any package by glob 'linux-headers-5.15.0-53-generic'
N: Couldn't find any package by regex 'linux-headers-5.15.0-53-generic'
E: No packages found

The minimum generic kernel version seems to be 5.15.0-100-generic

$ apt-cache show linux-headers-5.15.0-100-generic
Package: linux-headers-5.15.0-100-generic
Architecture: amd64
Version: 5.15.0-100.110
Priority: optional
Section: devel
Source: linux
Origin: Ubuntu
Maintainer: Ubuntu Kernel Team <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 24358
Provides: linux-headers, linux-headers-3.0
Depends: linux-headers-5.15.0-100, libc6 (>= 2.34), libelf1 (>= 0.142), libssl3 (>= 3.0.0~~alpha1), zlib1g (>= 1:1.2.3.3)
Filename: pool/main/l/linux/linux-headers-5.15.0-100-generic_5.15.0-100.110_amd64.deb
Size: 2858450
MD5sum: 5ecdb13f38c0106f0a65588af6702c49
SHA1: d79fcd44bf91cca0e0f7ff0efc6d9c49156f12af
SHA256: fee43b80b93f46d00757534a5f2954471db3d97d2223524a5dc8017a4c3ac2fd
SHA512: 4872bb8f1d20b0006f9e3c36d58fc526604963ae230c3489ec026e56a35d611a3f97f924850831f04b77e48b33e80467a2697f0d74a0e3d9b814ad12c0aff374
Description-en: Linux kernel headers for version 5.15.0 on 64 bit x86 SMP
 This package provides kernel header files for version 5.15.0 on
 64 bit x86 SMP.
 .
 This is for sites that want the latest kernel headers.  Please read
 /usr/share/doc/linux-headers-5.15.0-100/debian.README.gz for details.
Description-md5: 87a356f8838d1ead10dec511f0dda686

@tariq1890 tariq1890 reopened this Feb 6, 2025
@sanketnadkarni
Copy link

When @runitmisra and I were trying to fix this, we found that the ubuntu-eks k8s 1.30 AMI fails to run the driver container properly, but the 1.31 does that.
We were testing this in EKS cluster, and when using ubuntu 22.04 make sure add-ons like kube-proxy, coredns and amazon-vpc-cni are of compatible version.

@v1nc3nt27
Copy link

v1nc3nt27 commented Feb 7, 2025

@tariq1890 I am limited to use the images as they come. There is just this one Ubuntu image with the old kernel and I have no dedicated machines, just ephemeral node pools. Hence, I tried this workaround https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-outdated-kernels.html by creating a ConfigMap

apiVersion: v1
data:
  custom-ubuntu-repo.list: >
    deb http://archive.ubuntu.com/ubuntu/ jammy main restricted universe
    multiverse

    deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted universe
    multiverse

    deb http://archive.ubuntu.com/ubuntu/ jammy-security main restricted
    universe multiverse
kind: ConfigMap
metadata:
  name: repo-config
  namespace: gpu-operator

and adding

driver:
  repoConfig:
    configMapName: repo-config
    destinationDir: /etc/apt/sources.list.d

However, this gets me the following

========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 550.144.03 for Linux kernel version 5.15.0-53-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:1 and /etc/apt/sources.list.d/sources.list:1
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:2 and /etc/apt/sources.list.d/sources.list:2
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (universe/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
W: Target Packages (universe/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:3 and /etc/apt/sources.list.d/sources.list:3
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Do you know why the workaround is failing and/or is there any other idea you have that could help? Would it make sense to downgrade the GPU Operator/driver perhaps to some version? Thanks a lot!

@sanketnadkarni The add-ons are all running fine. Unfortunately k8s 1.29 is the latest I have in my cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants