Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

Open
tkaufmann24 opened this issue Jan 11, 2022 · 13 comments
Open

bpf_prog_query(BPF_CGROUP_DEVICE) failed #154

tkaufmann24 opened this issue Jan 11, 2022 · 13 comments

Comments

@tkaufmann24
Copy link

1. Issue or feature description

When trying to run a docker container with:

  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  --gpus all \

This is the output
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

2. Steps to reproduce the issue

docker run -d \
  --name=jellyfin \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=Europe/Berlin \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  --gpus all \
  -p 8096:8096 \
  -v /home/${USER}/server/configs/jellyfin:/config \
  -v /home/${USER}/server/media:/data/media \
  --restart unless-stopped \
  lscr.io/linuxserver/jellyfin

3. Information to attach (optional if deemed irrelevant)

Kernelversion
Linux 5.15.7-1-pve
in a LXC Container with Debian 10 and with cgroup2 arguments given to:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 511:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 239:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 33%   33C    P8     1W /  38W |      1MiB /  1999MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@elezar
Copy link
Member

elezar commented Jan 11, 2022

Could this be related to #151?

@klueska
Copy link
Contributor

klueska commented Jan 18, 2022

This is strange indeed, as docker certainly should have applied a set of device filters before libnvidia-container is invoked (which assumes at least one device filter program already exists and attempts to update it).

@klueska
Copy link
Contributor

klueska commented Jan 18, 2022

That said, I'm not that familiar with LXC (and how one can create LXC containers using the docker command line as you show). Could you give a more complete example of how to reproduce this?

@Scronkfinkle
Copy link

I am also running into this error. It appeared after I enabled cgroups v2. I set the kernel GRUB_CMDLINE_LINUX parameter as described here: https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version

Originally, I was getting an error like this:

nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied": unknown.

After adding the kernel param, updating and rebooting, I now get

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

@Scronkfinkle
Copy link

To follow-up, the issue disappears if I toggle the no-cgroup configuration option. When set to true, the error goes away, but reappears when I set it to false. This is occuring for accounts using docker-rootless

@elezar
Copy link
Member

elezar commented Feb 23, 2022

@Scronkfinkle we released an updated version of the NVIDIA Container Toolkit (including libnvidia-container1) that should address this issue. Could you update to v1.8.1 and see if the problem persists?

@klueska
Copy link
Contributor

klueska commented Feb 23, 2022

I don't think updating to v1.8.1 will help in the case of running rootless. No matter what we do, running rootless containers will never give libnvidia-container permission to modify cgroups, and the only way to make this work will be to set no-cgroups: true.

A similar bug was address by v1.8.1 #151 (comment), but this one is different (and I wouldn't actually call it a bug, just a lack of documentation on how to run with rootless).

@Scronkfinkle
Copy link

@elezar for what it's worth, I am running 1.8.1

# apt policy libnvidia-container1
libnvidia-container1:
  Installed: 1.8.1-1

# apt policy nvidia-container-toolkit
nvidia-container-toolkit:
  Installed: 1.8.1-1

@klueska when you say "just a lack of documentation on how to run with rootless", do you mean that an undocumented solution exists, or instead that it's undocumented that one cannot run rootless docker with libnvidia-container?

@eckozen84
Copy link

eckozen84 commented Aug 27, 2022

Just wanted to add I have an almost identical configuration/setup as the OP and running into the same issue. I am running NVIDIA Container Toolkit 1.10.0. The LXC container which I am running on Debian/Proxmox is an unpriviledged one.

The only solution that seems to work for now is toggle no-cgroups=true.

I have attached the nvidia-container-toolkit debug logs if it helps. First is with no-cgroups=true and second is no-cgroups=false.

nvidia-container-toolkit_no-cgroup-true.log
nvidia-container-toolkit_no-cgroup-false.log

@hholst80
Copy link

no-cgroups=true is a no-go for me as I need to run GPU container workloads using both root and rootless config on the same host.

@gitwittidbit
Copy link

So... is there a solution (other than toggling off cgroups)?

@hholst80
Copy link

So... is there a solution (other than toggling off cgroups)?

This works fine for me now. I am using cgroups v2 and both root and rootless docker on the same host.

Don't take this as proof that it will work in general.

Fedora 38 + docker-ce repo (v22.0.5)

[user@goblin ~]$ docker version
Client: Docker Engine - Community
 Version:           24.0.5
 API version:       1.43
 Go version:        go1.20.6
 Git commit:        ced0996
 Built:             Fri Jul 21 20:37:15 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.6
  Git commit:       a61e2b4
  Built:            Fri Jul 21 20:35:40 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.22
  GitCommit:        8165feabfdfe38c65b599c4993d227328c231fca
 nvidia:
  Version:          1.1.8
  GitCommit:        v1.1.8-0-g82f18fe
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 rootlesskit:
  Version:          1.1.1
  ApiVersion:       1.1.1
  NetworkDriver:    slirp4netns
  PortDriver:       builtin
  StateDir:         /tmp/rootlesskit3951089129
 slirp4netns:
  Version:          1.2.1
  GitCommit:        09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
[user@goblin ~]$

@hholst80
Copy link

hholst80 commented Jan 4, 2024

Addendum: After upgrading to Fedora 39 it stopped working.

nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants