-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no runtime for "nvidia" is configured #662
Comments
Of note, I have also tried without KinD and instead using k0s with the exact same result. |
Could you confirm that you're able to run |
I can confirm that it does not run inside kind: on the bare metal:
from a container inside of k0s:
and from inside kind:
with this as my deployment:
|
What are you doing to inject GPU support into the docker container that kind starts to represent the k8s node? Something like this is necessary: |
Using the example config you supplied I get the same results:
I forgot to include that config file: /etc/nvidia-container-runtime/config.toml
|
I even gave that create-cluster.sh script a try:
Same results though. |
appears to be the same issue here NVIDIA/k8s-device-plugin#478 |
Backing up … what about running with GPUs under docker in general (I.e. without kind). docker run -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04 nvidia-smi If things are not configured properly to have that work, then kind will not work either. |
To be clear, that will work so long as accept-nvidia-visible-devices-as-volume-mounts = false Once that is configured to true you would need to run: docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:22.04 nvidia-smi |
Both seem to work:
|
OK. That’s encouraging. So you’re saying that even with that configured properly if you run the cluster-create.sh script from the k8s-dra-driver repo, docker exec into the worker node created by kind, and run nvidia-smi, it doesn’t work? |
well at the moment ./create-cluster.sh ends with this error:
and
the build is successful from: so I'm kind of confused at what is wrong. I tried doing an equivalent ctr run with:
but it is just hanging here with no output. |
I figured out the equivalent ctr command ( I had nvidiacontainer missing above):
in comparison to the docker:
which I'm kind of uncertain why that file exists here, but not in the ctr form?
probably some magic I'm unaware of. |
ctr does not use the nvidia-container-runtime even if you have configured the CRI plugin in the containerd config to use it. The ctr command does not use CRI so it would need to be configured elsewhere to use the nvidia runtime (but that wouldn’t help with your current problem anyway of trying to get k8s to work — which does communicate with containerd over CRI). |
Since I don't have k0s experience, let's start out assuming that your goal is to install the GPU Operator in a Kind cluster with GPU support. This involves two stages:
I've tried to provide more details for each of the stages below. In order to get to the bottom of this issue we would need to identify which of these is not working as expected. Once we've run through the steps for kind it may be possible to map the steps to something like k0s. Note that as prerequisites:
Starting a kind cluster with GPUs and drivers injected.This needs to be set up as described in kubernetes-sigs/kind#3257 (comment) This means that we need to do the following:
In order to verify that the nodes have the GPU devices and Driver installed correctly one can exec into the Kind worker node and run
This should give the same output as on the host. I noted in your example that you are starting a single node Kind cluster. This should not affect the behaviour, but is a difference between our cluster definitions and the ones that you use. Installing the GPU Operator on the Kind clusterAt this point, the Kind cluster represents a k8s cluster with ony the GPU Driver installed. Even though the NVIDIA Container Toolkit is installed on the host, it has not been injected into the nodes. This means that we should do one of the following:
For the Kind demo included in this repo, we don't use the GPU operator and as such we install the container toolkit when creating the cluster: https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L38-L47 Note that since the Kind nodes themselves are effectively Debian nodes and are not officially supported. Most of this might be due to driver cotainer limitations and may not be applicable in this case, since we are dealing with a preinstalled driver. |
on the host:
However I seem to be stuck on the install inside the worker:
|
of note, I am using the kind cluster config from this repo: so no longer single-node. |
For "reasons" we were injecting the
Remove the lines here in the kind cluster config. (Or unmount I have an open action item to improve the installation of the toolkit in the DRA driver repo, but have not gotten around to it. |
so unmounting /usr/bin/nvidia-ctk fixed the apt issues, and I can install nvidia-container-toolkit just fine, but that doesn't solve the problem, the nvidia-device-plugin-daemonset still seems unable to see the GPU
|
@joshuacox is containerd in the Kind node configured to use the See https://github.com/NVIDIA/k8s-device-plugin/blob/2bef25804caf5924f35a164158f097f954fe4c74/demo/clusters/kind/scripts/create-kind-cluster.sh#L50-L55 where we do this for the device plugin. If you're installing the GPU Operator with |
I am just fine with setting toolkit.enabled=true or any other flags, I just want it to work. Seems to be getting closer, do I need to umount another symlink here?
that was from a ./create-cluster.sh (in /k8s-dra-driver/demo/clusters/kind) with this afterwards:
|
This issue is probably due to the symlink creation not working under kind. Please update the environement for the validator in the ClusterPolicy to disable the creation of symlinks as described in the error message. See also #567 |
Environment for the validator in ClusterPolicy? I have a tiny section of the daemonset that has a clusterpolicy
Which all of this seems way beyond the documentation. @elezar is this because as you said "Debian nodes and are not officially supported". If so what nodes are supported? On this page: https://nvidia.github.io/libnvidia-container/stable/deb/ it says: ubuntu18.04, ubuntu20.04, ubuntu22.04, debian10, debian11 so is this all because my host OS is debian 12? |
It just means when you start the operator, additionally pass:
|
I also tried removing the quotes around true to match my other set lines, and got the exact same results.
I am also not seeing a validator section in the values.yaml: Am I looking in the wrong place? |
use not all possible values are shown in the top-level values.yaml |
omg @klueska that one works!
and to be clear, for any of you stumbling in from the internet here is my complete additional steps, beyond
Now then why did I have to do all this extra work over and above the documentation? Is it just because I'm on debian 12 (I started on Arch linux before opening this issue I decided debian might be more stable). If this is the expected behavior I'll gladly make a PR documenting all this, but somehow I feel this is not the case? I am installing jammy22.04 to a partition to test some more. |
You're probably the first to run the operator under |
Hmmm, now I am going to have to give this another shot using another method, as I said I've tried k0s above and will give that a second try now that I have a working sanity check. I am familiar bootstraping a cluster using kubeadm and kubespray both, I even scripted it all out with another project kubash. Are there any other setups that anyone has tried? What is 'supported'? |
I've transferred this issue to the |
@joshuacox just for reference. The compatibility with Debian that is an issue here is not that of the NVIDIA Container Toolkit (or even the device plugin), but that of the GPU Operator. For the official support matrix see: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms Note that it is my understanding that this is largely due to the driver container, but there may be some subtle issues that arrise from not having qualified the stack on the target operating system. For what it's worth, we are starting to look at using kind for basic internal tests, and as we address some rough edges for this these should make it into the released versions -- although the question of official platform support is not something that I can speak to at present. |
@elezar and @klueska thank you guys for helping so much! And thanks for the transfer to this repo, this is probably where I should've submitted the issue in the first place. @elezar how can I help in facilitating in building these internal tests? I am looking around this repo, I don't see a demo directory like we were dealing with above, is that the sort of thing we might want to build here? I'd certainly be interested in facilitating any of this process that I can. |
Althought @shivamerla and @cdesiniotis should also chime in here, I think creating a PR adding a |
This is fine by me. @joshuacox contributions are welcome! @joshuacox there is one minor detail I would like to point out. In your helm install command, you explicitly set |
To clarify: since |
@elezar @cdesiniotis I have set it to false for now, I have a WIP branch here. I'm not seeing any nvidia-driver pods, but I definitely have a lot more pods and more importantly an allocatable GPU with the release chart. At the moment if install the release chart
yet I only get these pods when I use the local chart:
with the only difference between the two scripts being:
I am running full delete cluster, create cluster and install operator with the demo.sh e.g. for the local chart:
for the release chart:
Something eludes me as to what the difference is at the moment, I'll do some diff'ing around to investigate. I'll go ahead and prep a PR soon, but it's still a WIP for now. |
@elezar and @klueska , the only real difference I can see is the gdrcopy section on the local driver, am I missing something else?
I have a draft PR open here |
@klueska @elezar @cdesiniotis PR is open and ready if only the release chart is considered. I am still having issues with the local chart in the In short, release works great e.g.
However, the local install with gdrcopy both enabled and disabled are falling a bit short. e.g.
I'm failing to see the real difference though in the actual chart. |
1. Issue or feature description
When following the quickstart I end up with this error in
k describe po -n gpu-operator gpu-feature-discovery-6tk4h
Warning FailedCreatePodSandBox 0s (x5 over 49s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
with my kind-config.yaml
Common error checking:
nvidia-smi -a
on your hostand
docker run --rm nvidia/cuda:12.3.1-devel-centos7 nvidia-smi
/etc/docker/daemon.json
)and /etc/docker/daemon.json
and /etc/containerd/config.toml
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
and the helm below fails as well:
uname -a
uname -a
Linux saruman 6.1.0-17-amd64 NVIDIA/k8s-device-plugin#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
dmesg
none that I see?
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
the above page no longer exists.
sudo journalctl -u nvidia-container-toolkit
-- No entries --
The text was updated successfully, but these errors were encountered: