container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

logan2211 · 2024-11-07T22:32:27Z

This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.

I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:

    toolkit:
      env:
      - name: CONTAINERD_CONFIG
        value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_SET_AS_DEFAULT
        value: "false"

The pod log:

nvidia-container-toolkit-ctr IS_HOST_DRIVER=false
nvidia-container-toolkit-ctr NVIDIA_DRIVER_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DRIVER_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr NVIDIA_DEV_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DEV_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="disabling device node creation since --cdi-enabled=false"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg=Initializing
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Shutting Down"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"

I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:

  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:

The text was updated successfully, but these errors were encountered:

logan2211 · 2024-11-07T22:38:10Z

It makes sense that the command it is trying to run fails, there is no containerd binary on the host, so chroot /host containerd config dump is expected to fail on a k3s cluster.

logan2211 · 2024-11-07T22:38:55Z

Looks like the fix is in progress here NVIDIA/nvidia-container-toolkit#777

cdesiniotis · 2024-11-07T22:39:18Z

Hi @logan2211 thanks for reporting this issue. It is on our radar and we are working on getting a fix out for this. We recently switched to fetching the currently applied container runtime configuration via CLI (e.g. containerd config dump) rather than from a file (see NVIDIA/nvidia-container-toolkit@f477dc0) . This appears to have broke systems where the CLI binary is not in the PATH, like k3s. We are working on using the TOML file as a fallback option in case the CLI binary cannot be found: NVIDIA/nvidia-container-toolkit#777

logan2211 · 2024-11-07T22:44:19Z

Thanks, we are trying to upgrade the cluster urgently due to the CVEs. I suppose one possible workaround may be to downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

edit: after testing, proposed workaround downgrading gpu-operator and pinning the driver in values seems to work fine, just noting for anyone else experiencing this issue.

cdesiniotis · 2024-11-07T22:59:13Z

downgrade gpu-operator to v24.6.2 and override the driver version to 550.127.05?

This should work. Or alternatively you can stick to GPU Operator v24.9.0 and downgrade the NVIDIA Container Toolkit to 1.16.2, which does not contain this change.

cdesiniotis · 2024-11-14T19:37:35Z

NVIDIA Container Toolkit 1.17.1 is now available and contains a fix for this issue: https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.1

I would recommend overriding the NVIDIA Container Toolkit version to 1.17.1 by configuring toolkit.version in Clusterpolicy.

elpsyr · 2024-11-25T10:24:57Z

This approach can solve this problem :)

vi /usr/local/bin/containerd

 #!/bin/bash
 /var/lib/rancher/rke2/bin/containerd --config /var/lib/rancher/rke2/agent/etc/containerd/config.toml "$@"

sudo chmod +x /usr/local/bin/containerd

cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

logan2211 commented Nov 7, 2024

logan2211 commented Nov 7, 2024

logan2211 commented Nov 7, 2024

cdesiniotis commented Nov 7, 2024

logan2211 commented Nov 7, 2024 •

edited

Loading

cdesiniotis commented Nov 7, 2024

cdesiniotis commented Nov 14, 2024

elpsyr commented Nov 25, 2024

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

Comments

logan2211 commented Nov 7, 2024

logan2211 commented Nov 7, 2024

logan2211 commented Nov 7, 2024

cdesiniotis commented Nov 7, 2024

logan2211 commented Nov 7, 2024 • edited Loading

cdesiniotis commented Nov 7, 2024

cdesiniotis commented Nov 14, 2024

elpsyr commented Nov 25, 2024

logan2211 commented Nov 7, 2024 •

edited

Loading