bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

tkaufmann24 · 2022-01-11T19:47:34Z

1. Issue or feature description

When trying to run a docker container with:

  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  --gpus all \

This is the output
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

2. Steps to reproduce the issue

docker run -d \
  --name=jellyfin \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=Europe/Berlin \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  --gpus all \
  -p 8096:8096 \
  -v /home/${USER}/server/configs/jellyfin:/config \
  -v /home/${USER}/server/media:/data/media \
  --restart unless-stopped \
  lscr.io/linuxserver/jellyfin

3. Information to attach (optional if deemed irrelevant)

Kernelversion
Linux 5.15.7-1-pve
in a LXC Container with Debian 10 and with cgroup2 arguments given to:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 511:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 239:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 33%   33C    P8     1W /  38W |      1MiB /  1999MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

elezar · 2022-01-11T19:58:02Z

Could this be related to #151?

klueska · 2022-01-18T12:56:48Z

This is strange indeed, as docker certainly should have applied a set of device filters before libnvidia-container is invoked (which assumes at least one device filter program already exists and attempts to update it).

klueska · 2022-01-18T13:00:50Z

That said, I'm not that familiar with LXC (and how one can create LXC containers using the docker command line as you show). Could you give a more complete example of how to reproduce this?

Scronkfinkle · 2022-02-22T23:41:01Z

I am also running into this error. It appeared after I enabled cgroups v2. I set the kernel GRUB_CMDLINE_LINUX parameter as described here: https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version

Originally, I was getting an error like this:

nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied": unknown.

After adding the kernel param, updating and rebooting, I now get

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

Scronkfinkle · 2022-02-22T23:46:25Z

To follow-up, the issue disappears if I toggle the no-cgroup configuration option. When set to true, the error goes away, but reappears when I set it to false. This is occuring for accounts using docker-rootless

elezar · 2022-02-23T08:42:40Z

@Scronkfinkle we released an updated version of the NVIDIA Container Toolkit (including libnvidia-container1) that should address this issue. Could you update to v1.8.1 and see if the problem persists?

klueska · 2022-02-23T09:54:47Z

I don't think updating to v1.8.1 will help in the case of running rootless. No matter what we do, running rootless containers will never give libnvidia-container permission to modify cgroups, and the only way to make this work will be to set no-cgroups: true.

A similar bug was address by v1.8.1 #151 (comment), but this one is different (and I wouldn't actually call it a bug, just a lack of documentation on how to run with rootless).

Scronkfinkle · 2022-02-23T16:36:31Z

@elezar for what it's worth, I am running 1.8.1

# apt policy libnvidia-container1
libnvidia-container1:
  Installed: 1.8.1-1

# apt policy nvidia-container-toolkit
nvidia-container-toolkit:
  Installed: 1.8.1-1

@klueska when you say "just a lack of documentation on how to run with rootless", do you mean that an undocumented solution exists, or instead that it's undocumented that one cannot run rootless docker with libnvidia-container?

eckozen84 · 2022-08-27T01:34:29Z

Just wanted to add I have an almost identical configuration/setup as the OP and running into the same issue. I am running NVIDIA Container Toolkit 1.10.0. The LXC container which I am running on Debian/Proxmox is an unpriviledged one.

The only solution that seems to work for now is toggle no-cgroups=true.

I have attached the nvidia-container-toolkit debug logs if it helps. First is with no-cgroups=true and second is no-cgroups=false.

nvidia-container-toolkit_no-cgroup-true.log
nvidia-container-toolkit_no-cgroup-false.log

hholst80 · 2023-03-17T15:16:58Z

no-cgroups=true is a no-go for me as I need to run GPU container workloads using both root and rootless config on the same host.

gitwittidbit · 2023-08-22T16:14:43Z

So... is there a solution (other than toggling off cgroups)?

hholst80 · 2023-08-28T22:57:00Z

So... is there a solution (other than toggling off cgroups)?

This works fine for me now. I am using cgroups v2 and both root and rootless docker on the same host.

Don't take this as proof that it will work in general.

Fedora 38 + docker-ce repo (v22.0.5)

[user@goblin ~]$ docker version
Client: Docker Engine - Community
 Version:           24.0.5
 API version:       1.43
 Go version:        go1.20.6
 Git commit:        ced0996
 Built:             Fri Jul 21 20:37:15 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.6
  Git commit:       a61e2b4
  Built:            Fri Jul 21 20:35:40 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.22
  GitCommit:        8165feabfdfe38c65b599c4993d227328c231fca
 nvidia:
  Version:          1.1.8
  GitCommit:        v1.1.8-0-g82f18fe
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 rootlesskit:
  Version:          1.1.1
  ApiVersion:       1.1.1
  NetworkDriver:    slirp4netns
  PortDriver:       builtin
  StateDir:         /tmp/rootlesskit3951089129
 slirp4netns:
  Version:          1.2.1
  GitCommit:        09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
[user@goblin ~]$

hholst80 · 2024-01-04T22:32:58Z

Addendum: After upgrading to Fedora 39 it stopped working.

nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

tkaufmann24 commented Jan 11, 2022

elezar commented Jan 11, 2022

klueska commented Jan 18, 2022

klueska commented Jan 18, 2022

Scronkfinkle commented Feb 22, 2022

Scronkfinkle commented Feb 22, 2022

elezar commented Feb 23, 2022

klueska commented Feb 23, 2022

Scronkfinkle commented Feb 23, 2022

eckozen84 commented Aug 27, 2022 •

edited

Loading

hholst80 commented Mar 17, 2023

gitwittidbit commented Aug 22, 2023

hholst80 commented Aug 28, 2023

hholst80 commented Jan 4, 2024

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

Comments

tkaufmann24 commented Jan 11, 2022

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

elezar commented Jan 11, 2022

klueska commented Jan 18, 2022

klueska commented Jan 18, 2022

Scronkfinkle commented Feb 22, 2022

Scronkfinkle commented Feb 22, 2022

elezar commented Feb 23, 2022

klueska commented Feb 23, 2022

Scronkfinkle commented Feb 23, 2022

eckozen84 commented Aug 27, 2022 • edited Loading

hholst80 commented Mar 17, 2023

gitwittidbit commented Aug 22, 2023

hholst80 commented Aug 28, 2023

hholst80 commented Jan 4, 2024

eckozen84 commented Aug 27, 2022 •

edited

Loading