Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The installation of Charmed kubernetes with GPU as local couldn't ended #830

Open
iiot-architect opened this issue Feb 15, 2024 · 9 comments

Comments

@iiot-architect
Copy link

I'm trying the installation of Charmed Kubernetes with NVIDIA GPU on an Amazon EC2 instance(g5.xlarge) as local:

sudo snap install juju --classic
juju add-credential localhost
juju clouds
juju bootstrap
juju add-model k8s
juju deploy charmed-kubernetes
juju config calico ignore-loose-rpf=true

However I seem that the process isn't ended for over 3 hours:

ubuntu@ip-10-10-1-38:~$ juju status
Model  Controller           Cloud/Region         Version  SLA          Timestamp
k8s    localhost-localhost  localhost/localhost  3.3.1    unsupported  09:31:30Z

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                    3.21.4   active       5  calico                    1.27/stable   87  no       Calico is active
containerd                         blocked      5  containerd                1.27/stable   65  no       containerd resource binary containerd-stress failed a version check
easyrsa                   3.0.1    active       1  easyrsa                   1.27/stable   42  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.27/stable  742  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.27/stable   79  yes      Loadbalancer ready.
kubernetes-control-plane  1.27.10  waiting      2  kubernetes-control-plane  1.27/stable  274  no       Waiting for 4 kube-system pods to start
kubernetes-worker         1.27.10  waiting      3  kubernetes-worker         1.27/stable  112  yes      Waiting for kubelet to start.

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        10.132.163.17                 Certificate Authority connected.
etcd/0*                      active    idle   1        10.132.163.184  2379/tcp      Healthy with 3 known peers
etcd/1                       active    idle   2        10.132.163.135  2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   3        10.132.163.233  2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   4        10.132.163.33   443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0   waiting   idle   5        10.132.163.119  6443/tcp      Waiting for 4 kube-system pods to start
  calico/3                   active    idle            10.132.163.119                Calico is active
  containerd/3               blocked   idle            10.132.163.119                containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1*  waiting   idle   6        10.132.163.146  6443/tcp      Waiting for 4 kube-system pods to start
  calico/4                   active    idle            10.132.163.146                Calico is active
  containerd/4               blocked   idle            10.132.163.146                containerd resource binary containerd-stress failed a version check
kubernetes-worker/0*         waiting   idle   7        10.132.163.121  80,443/tcp    Waiting for kubelet to start.
  calico/2                   active    idle            10.132.163.121                Calico is active
  containerd/2               blocked   idle            10.132.163.121                containerd resource binary containerd-stress failed a version check
kubernetes-worker/1          waiting   idle   8        10.132.163.243  80,443/tcp    Waiting for kubelet to start.
  calico/0*                  active    idle            10.132.163.243                Calico is active
  containerd/0*              blocked   idle            10.132.163.243                containerd resource binary containerd-stress failed a version check
kubernetes-worker/2          waiting   idle   9        10.132.163.140  80,443/tcp    Waiting for kubelet to start.
  calico/1                   active    idle            10.132.163.140                Calico is active
  containerd/1               blocked   idle            10.132.163.140                containerd resource binary containerd-stress failed a version check

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.132.163.17   juju-84dc78-0  [email protected]      Running
1        started  10.132.163.184  juju-84dc78-1  [email protected]      Running
2        started  10.132.163.135  juju-84dc78-2  [email protected]      Running
3        started  10.132.163.233  juju-84dc78-3  [email protected]      Running
4        started  10.132.163.33   juju-84dc78-4  [email protected]      Running
5        started  10.132.163.119  juju-84dc78-5  [email protected]      Running
6        started  10.132.163.146  juju-84dc78-6  [email protected]      Running
7        started  10.132.163.121  juju-84dc78-7  [email protected]      Running
8        started  10.132.163.243  juju-84dc78-8  [email protected]      Running
9        started  10.132.163.140  juju-84dc78-9  [email protected]      Running

kubernetes-control-plane is repeatedly showing the message between 'Restarting snap.kubelet.daemon service' and 'Waiting for 4 kube-system pods to start'.
Also containerd is repeatedly showing the message between 'Unpacking containerd resource' and 'containerd resource binary containerd-stress failed a version check' as well.

The instance was installed the following software before the installation process:

NVIDIA GPU Driver:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
NVIDIA CUDA:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb

And I tried version 1.28/stable and 1.27/stable but the symptoms was almost same.
How can I improve this problem?

@evilnick
Copy link
Collaborator

Hi, sorry you are having an issue, it does look like contained is getting stuck in a loop preventing the nodes from coming up., which could I guess be caused by the GPU driver. The kubernetes-worker charm automatically downloads the required drivers which may be causing the issue if it has been pre-installed.

Perhaps @kwmonroe may have some insights here

In the meantime it may be worth trying to set contained to ignore the GPU to confirm that is the issue:

juju config contained gpu_driver="none"

or trying again without pre-installing the drivers.

@iiot-architect
Copy link
Author

iiot-architect commented Feb 15, 2024

Well, if the installation is without installing GPU driver and CUDA, the process is ended normally.
However the message 'without GPU' is showed.
After I confirmed it, I'm trying the installation with the drivers now.
Note: the next step, I'll try LLM on the Kubernetes cluster. so GPU is needed basically.

`ubuntu@ip-10-10-11-82:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
k8s localhost-localhost localhost/localhost 3.4.0 unsupported 09:54:24Z

App Version Status Scale Charm Channel Rev Exposed Message
calico 3.25.1 active 5 calico 1.28/stable 101 no Ready
containerd 1.6.8 active 5 containerd 1.28/stable 73 no Container runtime available
easyrsa 3.0.1 active 1 easyrsa 1.28/stable 48 no Certificate Authority connected.
etcd 3.4.22 active 3 etcd 1.28/stable 748 no Healthy with 3 known peers
kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer 1.28/stable 84 yes Loadbalancer ready.
kubernetes-control-plane 1.28.7 active 2 kubernetes-control-plane 1.28/stable 321 no Kubernetes control-plane running.
kubernetes-worker 1.28.7 active 3 kubernetes-worker 1.28/stable 134 yes Kubernetes worker running (without gpu support).

Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0 10.32.96.191 Certificate Authority connected.
etcd/0* active idle 1 10.32.96.166 2379/tcp Healthy with 3 known peers
etcd/1 active idle 2 10.32.96.78 2379/tcp Healthy with 3 known peers
etcd/2 active idle 3 10.32.96.5 2379/tcp Healthy with 3 known peers
kubeapi-load-balancer/0* active idle 4 10.32.96.141 443,6443/tcp Loadbalancer ready.
kubernetes-control-plane/0* active idle 5 10.32.96.126 6443/tcp Kubernetes control-plane running.
calico/4 active idle 10.32.96.126 Ready
containerd/4 active idle 10.32.96.126 Container runtime available
kubernetes-control-plane/1 active idle 6 10.32.96.24 6443/tcp Kubernetes control-plane running.
calico/3 active idle 10.32.96.24 Ready
containerd/3 active idle 10.32.96.24 Container runtime available
kubernetes-worker/0* active idle 7 10.32.96.187 80,443/tcp Kubernetes worker running (without gpu support).
calico/2 active idle 10.32.96.187 Ready
containerd/2 active idle 10.32.96.187 Container runtime available
kubernetes-worker/1 active idle 8 10.32.96.97 80,443/tcp Kubernetes worker running (without gpu support).
calico/0* active idle 10.32.96.97 Ready
containerd/0* active idle 10.32.96.97 Container runtime available
kubernetes-worker/2 active idle 9 10.32.96.169 80,443/tcp Kubernetes worker running (without gpu support).
calico/1 active idle 10.32.96.169 Ready
containerd/1 active idle 10.32.96.169 Container runtime available

Machine State Address Inst id Base AZ Message
0 started 10.32.96.191 juju-5e37ba-0 [email protected] Running
1 started 10.32.96.166 juju-5e37ba-1 [email protected] Running
2 started 10.32.96.78 juju-5e37ba-2 [email protected] Running
3 started 10.32.96.5 juju-5e37ba-3 [email protected] Running
4 started 10.32.96.141 juju-5e37ba-4 [email protected] Running
5 started 10.32.96.126 juju-5e37ba-5 [email protected] Running
6 started 10.32.96.24 juju-5e37ba-6 [email protected] Running
7 started 10.32.96.187 juju-5e37ba-7 [email protected] Running
8 started 10.32.96.97 juju-5e37ba-8 [email protected] Running
9 started 10.32.96.169 juju-5e37ba-9 [email protected] Running
ubuntu@ip-10-10-11-82:~$ `

@kwmonroe
Copy link
Contributor

@iiot-architect can you provide some details on your instance? i just deployed a g5.xlarge and got:

ubuntu@ip-172-31-20-96:~$ nproc
4

ubuntu@ip-172-31-20-96:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       224Mi        14Gi       0.0Ki       834Mi        14Gi
Swap:             0B          0B          0B

ubuntu@ip-172-31-20-96:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       7.6G  2.0G  5.7G  26% /

the charmed k8s bundle is pretty heavy weight -- especially deployed to lxd. i doubt 4 cores and 16g ram will be enough, but i'm positive 8G root filesystem won't be :)

is it possible you've run out of disk space on your instance?

@iiot-architect
Copy link
Author

Dear kwmonroe.

No, the disk space is no problem.
Since I has allocated 200GB capacity to the instance as the root storage volume of gp3.

@iiot-architect
Copy link
Author

According to the official blog, I think that NVIDIA Driver and CUDA should be installed to the host in advance:

https://ubuntu.com/blog/nvidia-cuda-inside-a-lxd-container

@iiot-architect
Copy link
Author

iiot-architect commented Feb 16, 2024

I seem that the configuration about containerd isn't effective:

ubuntu@ip-10-10-8-132:~$ juju status
Model  Controller           Cloud/Region         Version  SLA          Timestamp
k8s    localhost-localhost  localhost/localhost  3.4.0    unsupported  02:13:35Z

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                             waiting      5  calico                    1.28/stable  101  no       Configuring Calico
containerd                         blocked      5  containerd                1.28/stable   73  no       containerd resource binary containerd-stress failed a version check
easyrsa                   3.0.1    active       1  easyrsa                   1.28/stable   48  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.28/stable  748  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.28/stable   84  yes      Loadbalancer ready.
kubernetes-control-plane  1.28.6   waiting      2  kubernetes-control-plane  1.28/stable  321  no       Waiting for 4 kube-system pods to start
kubernetes-worker         1.28.6   waiting      3  kubernetes-worker         1.28/stable  134  yes      Waiting for kubelet to start.

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        10.215.33.158                 Certificate Authority connected.
etcd/0*                      active    idle   1        10.215.33.60    2379/tcp      Healthy with 3 known peers
etcd/1                       active    idle   2        10.215.33.190   2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   3        10.215.33.33    2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   4        10.215.33.103   443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0*  waiting   idle   5        10.215.33.109   6443/tcp      Waiting for 4 kube-system pods to start
  calico/4                   waiting   idle            10.215.33.109                 Configuring Calico
  containerd/4               blocked   idle            10.215.33.109                 containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1   waiting   idle   6        10.215.33.156   6443/tcp      Waiting for 4 kube-system pods to start
  calico/3                   waiting   idle            10.215.33.156                 Configuring Calico
  containerd/3               blocked   idle            10.215.33.156                 containerd resource binary containerd-stress failed a version check
kubernetes-worker/0*         waiting   idle   7        10.215.33.97    80,443/tcp    Waiting for kubelet to start.
  calico/2                   waiting   idle            10.215.33.97                  Configuring Calico
  containerd/2               blocked   idle            10.215.33.97                  containerd resource binary containerd-stress failed a version check
kubernetes-worker/1          waiting   idle   8        10.215.33.20    80,443/tcp    Waiting for kubelet to start.
  calico/0*                  waiting   idle            10.215.33.20                  Configuring Calico
  containerd/0*              blocked   idle            10.215.33.20                  containerd resource binary containerd-stress failed a version check
kubernetes-worker/2          waiting   idle   9        10.215.33.96    80,443/tcp    Waiting for kubelet to start.
  calico/1                   waiting   idle            10.215.33.96                  Configuring Calico
  containerd/1               blocked   idle            10.215.33.96                  containerd resource binary containerd-stress failed a version check

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.215.33.158  juju-7d866d-0  [email protected]      Running
1        started  10.215.33.60   juju-7d866d-1  [email protected]      Running
2        started  10.215.33.190  juju-7d866d-2  [email protected]      Running
3        started  10.215.33.33   juju-7d866d-3  [email protected]      Running
4        started  10.215.33.103  juju-7d866d-4  [email protected]      Running
5        started  10.215.33.109  juju-7d866d-5  [email protected]      Running
6        started  10.215.33.156  juju-7d866d-6  [email protected]      Running
7        started  10.215.33.97   juju-7d866d-7  [email protected]      Running
8        started  10.215.33.20   juju-7d866d-8  [email protected]      Running
9        started  10.215.33.96   juju-7d866d-9  [email protected]      Running
ubuntu@ip-10-10-8-132:~$ **juju config containerd gpu_driver="none"
WARNING the configuration setting "gpu_driver" already has the value "none"**

@evilnick
Copy link
Collaborator

evilnick commented Feb 16, 2024

According to the official blog, I think that NVIDIA Driver and CUDA should be installed to the host in advance:

https://ubuntu.com/blog/nvidia-cuda-inside-a-lxd-container

That blog post is 6 years old so I'm not sure how much of it is reliable any more.
If the containerd is already set to none, try setting it to "nvidia" instead. Though if it is set to none and failing then maybe the issue isn't the driver after all

@iiot-architect
Copy link
Author

Dear evilnick

If the containerd is already set to none, try setting it to "nvidia" instead.

Well, I seem that it's irrelevant.
Since I tried it but the result was almost same.
Also I added GPU to each Lxds but the result was same with the case of the driver installation in advance.

lxc config device add [Name of Lxd] gpu gpu

In addition, I changed the instance type from g5.xlarge to g4ad.2xlarge with the advanced installation of the driver but not almost changed the result.

@iiot-architect
Copy link
Author

Dear kwmonroe.

Thanks for your help.
I tried again based on your shared new document.

sudo apt update
sudo apt -y full-upgrade && sudo reboot -f
wget https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2204-535.154.05/nvidia-driver-local-91B8C5A2-keyring.gpg /usr/share/keyrings/
sudo apt install -y build-essential
wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run
sudo sh cuda_12.3.2_545.23.08_linux.run --silent
echo 'export PATH=$PATH:/usr/local/cuda' >> ~/.bashrc
source ~/.bashrc
nvidia-smi
https://deploy-preview-832--cdk-next.netlify.app/kubernetes/docs/install-local

Sure, the configuration process was ended normally.
However the workers are without GPU support:

ubuntu@ip-10-10-4-228:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ck8s localhost-localhost localhost/localhost 3.4.0 unsupported 09:23:26Z

App Version Status Scale Charm Channel Rev Exposed Message
calico 3.25.1 active 5 calico stable 101 no Ready
containerd 1.7.2 active 5 containerd stable 73 no Container runtime available
easyrsa 3.0.1 active 1 easyrsa stable 48 no Certificate Authority connected.
etcd 3.4.22 active 3 etcd stable 748 no Healthy with 3 known peers
kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer stable 84 yes Loadbalancer ready.
kubernetes-control-plane 1.28.7 active 2 kubernetes-control-plane stable 321 no Kubernetes control-plane running.
kubernetes-worker 1.28.7 active 3 kubernetes-worker stable 134 yes Kubernetes worker running (without gpu support).

And I added GPU to each Lxds of the workers but don't changed:

lxc config device add juju-4e969a-7 gpu gpu
lxc config device add juju-4e969a-8 gpu gpu
lxc config device add juju-4e969a-9 gpu gpu
lxc restart juju-4e969a-7
lxc restart juju-4e969a-8
lxc restart juju-4e969a-9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants