You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issue and identified the affected environments this can occur in. Workarounds for the affected environments are provided at the end of this document until a proper fix is released.
2. Summary of the issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a systemctl daemon-reload).
When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown Error
The container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of runc require that symlinks be present under /dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers
3. Affected environments
Affected environments are those using runc and enabling systemd cgroup management at the high-level container runtime.
If the system is NOT using systemd to manage cgroups, then it is NOT subject to this issue.
An exhaustive list of the affected environments is provided below:
Docker environment using containerd / runc:
Specific condition:
cgroup driver enabled with systemd (e.g. parameter "exec-opts": ["native.cgroupdriver=systemd"] set in /etc/docker/daemon.json).
Newer docker version is used where systemd cgroup management is the default (i.e. on Ubuntu 22.04).
Note: To check if Docker uses systemd cgroup management, run the following command (the output below indicates that systemd cgroup driver is enabled) :
$ docker info
...
Cgroup Driver: systemd
Cgroup Version: 1
K8s environment using containerd / runc:
Specific condition:
SystemdCgroup = true in the containerd configuration file (usually located here: /etc/containerd/config.toml) as shown below:
K8s environment (including OpenShift) using cri-o / runc:
Specific condition:
cgroup_manager enabled with systemd in the cri-o configuration file (usually located here: /etc/crio/crio.conf or /etc/crio/crio.conf.d/00-default) as shown below (sample with OpenShift):
Note: Podman environments use crun by default and are not subject to this issue unless runc is configured as the low-level container runtime to be used.
4. How to check if you are affected
You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.
The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):
For Docker environments
Using the nvidia-ctk utility:
The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
sudo nvidia-ctk system create-dev-char-symlinks \
--create-all
This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple udev rule to enforce this can be seen below:
# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"
A good place to install this rule would be: /lib/udev/rules.d/71-nvidia-dev-char.rules
In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
sudo nvidia-ctk system create-dev-symlinks \
--create-all \
--driver-root={{NVIDIA_DRIVER_ROOT}}
Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.
Explicitly disabling systemd cgroup management in Docker
Set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker.
Downgrading to docker.io packages where systemd is not the default cgroup manager (and not overriding that of course).
For K8s environments
Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).
For deployments using the standalone k8s-device-plugin (i.e. not through the use of the operator), following steps are required
When installing using k8s-device-plugin Helm chart, pass --set compatWithCPUManager=true parameter. This will ensure that k8s-device-plugin pod runs with env PASS_DEVICE_SPECS=true set. Refer to values here. Please note that this will run k8s-device-plugin with privileged mode.
For installing using static yaml spec, pass env PASS_DEVICE_SPECS=true explicitly to the k8s-device-plugin Daemonset. Also, the pod needs to be run with privileged SecurityContext. For e.g. refer here.
Install a udev rule as described in the previous section can be made to work around this issue. Be sure to pass the correct {{NVIDIA_DRIVER_ROOT}} in cases where the driver container is also in use.
Explicitly disabling systemd cgroup management in containerd or cri-o:
Remove the parameter cgroup_manager = "systemd" from cri-o configuration file (usually located here: /etc/crio/crio.conf or /etc/crio/crio.conf.d/00-default) and restart cri-o.
Downgrading to a version of the containerd.io package where systemd is not the default cgroup manager (and not overriding that, of course).
Upgrading runc version to at-least 1.1.7. This version has a fix to avoid the issue discussed here. Also, systemd version should be >=240.
When the NVIDIA driver is directly installed on the host (i.e without the driver container from the GPU Operator), make sure that following are met before device-plugin or any other containers could run. This will make sure that all required devices are injected into the containers with GPU requests.
Modulesnvidia, nvidia-uvm, nvidia-modeset are loaded using modprobe nvidia; modprobe nvidia-uvm; modprobe nvidia-modeset
All necessary control devices are created using nvidia-modprobe -u -m -c0 and nvidia-smi.
The text was updated successfully, but these errors were encountered:
cdesiniotis
changed the title
NOTICE: Containers losing access to GPU with error: "Failed to initialize NVML: Unknown Error"
NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"
Feb 7, 2023
1. Executive summary
Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issue and identified the affected environments this can occur in. Workarounds for the affected environments are provided at the end of this document until a proper fix is released.
2. Summary of the issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a
systemctl daemon-reload
).When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown Error
The container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of
runc
require that symlinks be present under/dev/char
to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.3. Affected environments
Affected environments are those
using runc
andenabling systemd cgroup management
at the high-level container runtime.If the system is NOT using
systemd
to managecgroups
, then it is NOT subject to this issue.An exhaustive list of the affected environments is provided below:
Docker environment using
containerd
/runc
:Specific condition:
cgroup
driver enabled withsystemd
(e.g. parameter"exec-opts": ["native.cgroupdriver=systemd"]
set in/etc/docker/daemon.json
).systemd cgroup
management is the default (i.e. on Ubuntu 22.04).Note: To check if Docker uses
systemd cgroup
management, run the following command (the output below indicates thatsystemd cgroup
driver is enabled) :K8s environment using
containerd
/runc
:SystemdCgroup = true
in thecontainerd
configuration file (usually located here:/etc/containerd/config.toml
) as shown below:systemd cgroup
management, issue the following command:K8s environment (including OpenShift) using
cri-o
/runc
:Specific condition:
cgroup_manager
enabled withsystemd
in thecri-o
configuration file (usually located here:/etc/crio/crio.conf
or/etc/crio/crio.conf.d/00-default
) as shown below (sample with OpenShift):Note: Podman environments use
crun
by default and are not subject to this issue unlessrunc
is configured as the low-level container runtime to be used.4. How to check if you are affected
You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.
For Docker environments
Run a test container:
Note: Make sure to mount the different devices as shown above. They are needed to narrow the problem down to this specific issue.
If your system has more than 1 GPU, append the above command with the additional
--device
mount. Example with a system that has 2 GPUs:Check the logs from the container:
Then initiate a
daemon-reload
:Check the logs from the container:
For K8s environments
Run a test pod:
Check the logs from the pod:
Then initiate a
daemon-reload
:Check the logs from the pod:
5. Workarounds
The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):
For Docker environments
Using the
nvidia-ctk
utility:The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in
/dev/char
for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple
udev
rule to enforce this can be seen below:A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rules
In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
Where
{{NVIDIA_DRIVER_ROOT}}
is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.Explicitly disabling systemd cgroup management in Docker
"exec-opts": ["native.cgroupdriver=cgroupfs"]
in the/etc/docker/daemon.json
file and restart docker.Downgrading to
docker.io
packages wheresystemd
is not the defaultcgroup
manager (and not overriding that of course).For K8s environments
Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).
For deployments using the standalone
k8s-device-plugin
(i.e. not through the use of the operator), following steps are requiredWhen installing using
k8s-device-plugin
Helm chart, pass--set compatWithCPUManager=true
parameter. This will ensure thatk8s-device-plugin
pod runs with envPASS_DEVICE_SPECS=true
set. Refer to values here. Please note that this will runk8s-device-plugin
withprivileged
mode.For installing using static yaml spec, pass env
PASS_DEVICE_SPECS=true
explicitly to thek8s-device-plugin
Daemonset. Also, the pod needs to be run withprivileged
SecurityContext. For e.g. refer here.Install a
udev
rule as described in the previous section can be made to work around this issue. Be sure to pass the correct{{NVIDIA_DRIVER_ROOT}}
in cases where the driver container is also in use.Explicitly disabling
systemd cgroup
management incontainerd
orcri-o
:cgroup_manager = "systemd"
fromcri-o
configuration file (usually located here:/etc/crio/crio.conf
or/etc/crio/crio.conf.d/00-default
) and restartcri-o
.Downgrading to a version of the
containerd.io
package wheresystemd
is not the defaultcgroup
manager (and not overriding that, of course).Upgrading
runc
version to at-least1.1.7
. This version has a fix to avoid the issue discussed here. Also,systemd
version should be>=240
.When the NVIDIA driver is directly installed on the host (i.e without the driver container from the GPU Operator), make sure that following are met before device-plugin or any other containers could run. This will make sure that all required devices are injected into the containers with GPU requests.
nvidia
,nvidia-uvm
,nvidia-modeset
are loaded usingmodprobe nvidia; modprobe nvidia-uvm; modprobe nvidia-modeset
nvidia-modprobe -u -m -c0
andnvidia-smi
.The text was updated successfully, but these errors were encountered: