-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109
Comments
It makes sense that the command it is trying to run fails, there is no |
Looks like the fix is in progress here NVIDIA/nvidia-container-toolkit#777 |
Hi @logan2211 thanks for reporting this issue. It is on our radar and we are working on getting a fix out for this. We recently switched to fetching the currently applied container runtime configuration via CLI (e.g. |
Thanks, we are trying to upgrade the cluster urgently due to the CVEs. I suppose one possible workaround may be to downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05? edit: after testing, proposed workaround downgrading gpu-operator and pinning the driver in values seems to work fine, just noting for anyone else experiencing this issue. |
This should work. Or alternatively you can stick to GPU Operator v24.9.0 and downgrade the NVIDIA Container Toolkit to 1.16.2, which does not contain this change. |
NVIDIA Container Toolkit 1.17.1 is now available and contains a fix for this issue: https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.1 I would recommend overriding the NVIDIA Container Toolkit version to 1.17.1 by configuring |
This approach can solve this problem :) vi /usr/local/bin/containerd #!/bin/bash
/var/lib/rancher/rke2/bin/containerd --config /var/lib/rancher/rke2/agent/etc/containerd/config.toml "$@" sudo chmod +x /usr/local/bin/containerd |
This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.
I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:
The pod log:
I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:
The text was updated successfully, but these errors were encountered: