Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

Open
logan2211 opened this issue Nov 7, 2024 · 7 comments
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@logan2211
Copy link

This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.

I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:

    toolkit:
      env:
      - name: CONTAINERD_CONFIG
        value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_SET_AS_DEFAULT
        value: "false"

The pod log:

nvidia-container-toolkit-ctr IS_HOST_DRIVER=false
nvidia-container-toolkit-ctr NVIDIA_DRIVER_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DRIVER_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr NVIDIA_DEV_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DEV_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="disabling device node creation since --cdi-enabled=false"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg=Initializing
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Shutting Down"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"

I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:

  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:
@logan2211
Copy link
Author

It makes sense that the command it is trying to run fails, there is no containerd binary on the host, so chroot /host containerd config dump is expected to fail on a k3s cluster.

@logan2211
Copy link
Author

Looks like the fix is in progress here NVIDIA/nvidia-container-toolkit#777

@cdesiniotis
Copy link
Contributor

Hi @logan2211 thanks for reporting this issue. It is on our radar and we are working on getting a fix out for this. We recently switched to fetching the currently applied container runtime configuration via CLI (e.g. containerd config dump) rather than from a file (see NVIDIA/nvidia-container-toolkit@f477dc0) . This appears to have broke systems where the CLI binary is not in the PATH, like k3s. We are working on using the TOML file as a fallback option in case the CLI binary cannot be found: NVIDIA/nvidia-container-toolkit#777

@logan2211
Copy link
Author

logan2211 commented Nov 7, 2024

Thanks, we are trying to upgrade the cluster urgently due to the CVEs. I suppose one possible workaround may be to downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

edit: after testing, proposed workaround downgrading gpu-operator and pinning the driver in values seems to work fine, just noting for anyone else experiencing this issue.

@cdesiniotis
Copy link
Contributor

downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

This should work. Or alternatively you can stick to GPU Operator v24.9.0 and downgrade the NVIDIA Container Toolkit to 1.16.2, which does not contain this change.

@cdesiniotis cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Nov 8, 2024
@cdesiniotis
Copy link
Contributor

NVIDIA Container Toolkit 1.17.1 is now available and contains a fix for this issue: https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.1

I would recommend overriding the NVIDIA Container Toolkit version to 1.17.1 by configuring toolkit.version in Clusterpolicy.

@elpsyr
Copy link

elpsyr commented Nov 25, 2024

This approach can solve this problem :)

vi /usr/local/bin/containerd

 #!/bin/bash
 /var/lib/rancher/rke2/bin/containerd --config /var/lib/rancher/rke2/agent/etc/containerd/config.toml "$@"

sudo chmod +x /usr/local/bin/containerd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

3 participants