You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environmental Info:
RKE2 Version:
rke2 version v1.31.1+rke2r1 (909d20d)
go version go1.22.6 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Rocky 9
Linux 5.14.0-427.24.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 8 17:47:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Running in AWS with 3 servers and 7 agents. Using cilium. Servers are 2 core 4 GB ram.
Describe the bug:
Our cluster is shutdown overnight for cost saving and started back up each morning. Twice in the last month a server node has failed to start properly and this has prevented some agents nodes from joining, despite the other 2 servers being marked as healthy. We were previously on v1.29.3+rke2r1 and never had this issue after running it for several months with the same shutdown behaviour.
Looking through the broken server (server0), kube-apiserver gets terminated during rke2-server startup which causes numerous other issues. ps aux | grep kube-apiserver returns nothing on server0.
kube-apiserver on server1 & server2 are filled with authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated].
Cilium is failing to start on server0 but with kube-apiserver down, I assume nothing will.
I have found this comment around load balancer health checks #5557 (comment) which we are not doing (but will implement) but I don't think it is the issue as etcd appears healthily.
Looking at kube-apiserver on server0
/var/log/containers/kube-apiserver-k8s-server0.DOMAIN_kube-system_kube-apiserver-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.log -> /var/log/pods/kube-system_kube-apiserver-k8s-server0.DOMAIN_80bcfddfbe1c3d0b4d90fe8d28b7a51c/kube-apiserver/22.log (attached as k8s-server0-not-ready-kube-apiserver-pod.log)
The kube-apiserver on server0 is container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171. In /var/log/messages I can see
Nov 7 06:02:02 k8s-server0 systemd[1]: Started libcontainer container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.
Nov 7 06:02:26 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.choiNb.mount: Deactivated successfully.
Nov 7 06:02:31 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.BKojHd.mount: Deactivated successfully.
Nov 7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Deactivated successfully.
Nov 7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Consumed 9.180s CPU time.
Weirdly, critctl reports it as running on server0
$ crictl ps --name kube-apiserver
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
a4d23bea00f67 1b44d478ec444 9 hours ago Running kube-apiserver 22 093e8959f7b4d kube-apiserver-k8s-server0.DOMAIN
$ crictl inspect a4d23bea00f67 | jq .info.pid
2372
$ ps aux | grep 2372
root 200950 0.0 0.0 6408 2176 pts/0 S+ 15:22 0:00 grep --color=auto 2372
Containerd logs on server0 say it has been killed but then something is trying to exec it
time="2024-11-07T06:02:41.609408926Z" level=info msg="Kill container \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\""
time="2024-11-07T06:02:41.987115383Z" level=error msg="ExecSync for \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\" failed" error="failed to exec in container: failed to start exec \"36daae2d0ca0aa2b27f83e18036018231cccee0bdfb824c5d11e9d9ed4020f3c\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
Environmental Info:
RKE2 Version:
rke2 version v1.31.1+rke2r1 (909d20d)
go version go1.22.6 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Rocky 9
Linux 5.14.0-427.24.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 8 17:47:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Running in AWS with 3 servers and 7 agents. Using cilium. Servers are 2 core 4 GB ram.
Describe the bug:
Our cluster is shutdown overnight for cost saving and started back up each morning. Twice in the last month a server node has failed to start properly and this has prevented some agents nodes from joining, despite the other 2 servers being marked as healthy. We were previously on v1.29.3+rke2r1 and never had this issue after running it for several months with the same shutdown behaviour.
Looking through the broken server (server0), kube-apiserver gets terminated during rke2-server startup which causes numerous other issues.
ps aux | grep kube-apiserver
returns nothing on server0.Steps To Reproduce:
Expected behavior:
The cluster starts up successfully.
The cluster is not severely degraded from 1 out 3 server nodes having issues.
Actual behavior:
The cluster does not always start up successfully.
The cluster is severely degraded from 1 out 3 server nodes having issues.
Additional context / logs:
etcd appears healthy. On server0 with 2456 being the pid of etcd
kube-apiserver on server1 & server2 are filled with
authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]
.Cilium is failing to start on server0 but with kube-apiserver down, I assume nothing will.
I have found this comment around load balancer health checks #5557 (comment) which we are not doing (but will implement) but I don't think it is the issue as etcd appears healthily.
Looking at kube-apiserver on server0
/var/log/containers/kube-apiserver-k8s-server0.DOMAIN_kube-system_kube-apiserver-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.log -> /var/log/pods/kube-system_kube-apiserver-k8s-server0.DOMAIN_80bcfddfbe1c3d0b4d90fe8d28b7a51c/kube-apiserver/22.log
(attached as k8s-server0-not-ready-kube-apiserver-pod.log)The kube-apiserver on server0 is container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171. In /var/log/messages I can see
Weirdly, critctl reports it as running on server0
Containerd logs on server0 say it has been killed but then something is trying to exec it
I cannot stop the zombie container
Running
systemctl restart rke2-server
on server0 fixes it and all nodes immediately join but we would like to understand why this is happening.Log attachments
k8s-agent0-not-ready-rke2-agent.log
k8s-server0-not-ready-containerd.log
k8s-server0-not-ready-rke2-server.log
k8s-server1-rke2-server.log
k8s-server2-rke2-server.log
k8s-agent6-rke2-agent.log
k8s-server0-not-ready-kube-apiserver-pod.log
The text was updated successfully, but these errors were encountered: