-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458
Comments
According to this #401 (comment) |
@sfxworks what version of GPU Operator are you using? We migrated to |
devel-ubi8 according to https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml#L50 |
The tag you linked worked. Though now other images are having issues with their defaults
Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out. |
.. containers:
- args:
- init
command:
- nvidia-driver
image: nvcr.io/nvidia/driver:latest-
imagePullPolicy: IfNotPresent
name: nvidia-driver-ctr
resources: {}
securityContext:
privileged: true |
It doesn't like my kernel anyway I guess :/
|
Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.
|
@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator. We append a |
I have |
I just installed GPU Operator with helm version
When I change GPU Operator to |
I'm also running into this issue ... on both |
Same error for us with EKS 1.27 and Ubuntu 22 |
Same error with release 23.3.1 any solution ...? |
also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""} kubectl describe node GPU-NODE | grep system node: |
few node toleration things worked passed and changed to helm chart in the official helm repo running into issue now where operator seems to be looking for image that does not exist and fails to pull ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2") which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node? edit: after digging through documentation looking into this now https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html |
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.
As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125
Nodes are now v1
The operator cannot reconcile and deployment of a pod requesting a GPU fails as result
2. Steps to reproduce the issue
The text was updated successfully, but these errors were encountered: