-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu-operator breaks when upgrading EKS to K8s v1.30 #1220
Comments
We also plan to use pre-compiled driver images going forward, but again, no kernel version newer than |
It's a bit confusing from the docs that, for the version mentioned by you, under the supported operating systems and Kubernetes platforms here, it's mentioned, under the Cloud Service Providers tab, that EKS is supported from v1.25-v1.28. I really doubt if it's the case, since it worked fine on v1.29 for you and only started failing with v1.30. If the document is correct, I will have to think multiple times before upgrading to any version beyond 1.28. ![]() |
The compatibility issue in my opinion is the kernel version. Nvidia does not provide driver support (either normal or pre-compiled) for any kernel version > 5.15 and Ubuntu does not provide an AMI which is both compatible with k8s v1.30+ AND has kernel v5.15! Same case for pre-compiled drivers So my question here is: Is there NO WAY to run gpu-operator managed clusters reliably on k8s v1.30 and above?? |
hi Runit, the standard driver container (using CUDA runfile as underlying executable) should normally work. for the precompiled driver container, today we only publish container images for Ubuntu 22.04 with 5.15 kernel. The main reasons we produced the precompiled driver images are for: For the doc issue, it's just a typo. We support currently 1.29 -> 1.31. Hope this helps. |
Hello @francisguillier, Thank you for the clarification it helps a lot.
|
the runfile will try to fetch the linux-headers-6.5.0-1020-aws package from Canonical repo in order to proceed with the runtime compilation of the kernel modules.
Can you check if you have a Security Group that would prevent the EKS node to reach out to us.archive.ubuntu.com? For the question dealing with producing the precompiled driver container images for kernel other than 5.15, we followed the kernel release lifecycle from Canonical: https://ubuntu.com/kernel/lifecycle (non HWE only). 5.15 was the main kernel for Ubuntu 22.04. |
Hello @francisguillier, Thank you for dedicating your time to this. You are correct, the standard version works with other kernel versions as well. When we added nodes with k8s v1.31 compatible Ubuntu 22.04 AMIs with kernel v6.8, the standard gpu driver container started working just fine. We were also able to build a custom pre-compiled driver image for kernel 6.8 and Ubuntu 22.04. I have just one more question: Does Nvidia not support pre-compiled images for GCP nodes? I could not find any |
Hi Runit, Glad to see driver container works now in your environment. For the question about precompiled driver container for GCP/GKE kernel variant, we cannot build it simply because associated packages are not available on Canonical repo servers. Only generic, AWS, Azure and Oracle are available and that's why we were able to produce the precompiled driver images for those CSP managed k8s services. |
I am seeing the almost same logs as @runitmisra.
I have a CCE cluster (k8s 1.29) on Open Telekom Cloud. The image i am using is ubuntu 22 with kernel 5.15.0-53-generic. So, as far as I understand, this should actually work. I don't have many other options as the cloud provider only has HCE OS 2.0 and Euler OS 2.9 (CentOS based) which are not compatible with the GPU operator I think. Do you have any ideas @francisguillier ? |
@v1nc3nt27 Can you update the kernel version and try again? The linux headers package for that kernel doesn't seem to be available anymore
The minimum generic kernel version seems to be
|
When @runitmisra and I were trying to fix this, we found that the ubuntu-eks k8s 1.30 AMI fails to run the driver container properly, but the 1.31 does that. |
@tariq1890 I am limited to use the images as they come. There is just this one Ubuntu image with the old kernel and I have no dedicated machines, just ephemeral node pools. Hence, I tried this workaround https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-outdated-kernels.html by creating a ConfigMap
and adding
However, this gets me the following
Do you know why the workaround is failing and/or is there any other idea you have that could help? Would it make sense to downgrade the GPU Operator/driver perhaps to some version? Thanks a lot! @sanketnadkarni The add-ons are all running fine. Unfortunately k8s 1.29 is the latest I have in my cloud. |
We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29
We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:
We started seeing gpu operator pods going into
Error
andCrashLoopBackOff
state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.Basic details:
Here is the pod status on the v1.30 node:
These pods keep terminating, crashing and getting recreated over and over.
Here are some more logs and info that might help:
Describe
nvidia-operator-validator
pod - Events show this errorLogs of driver daemonset pod:
Pod terminates and crashes after this.
I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.
5.15
and6.5
.5.15
no longer ships in AMIs for k8s v1.30 and we tested with6.5
and it does not work. In Amazon EKS specific docs, there is no mention of kernel version requirements and it states as long as you have a ubuntu 22.04 x86_64 image, you're good.The text was updated successfully, but these errors were encountered: