-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833
Comments
You've provided a lot of info on longhorn, but this is the RKE2 repo and you've not provided any information on the status of RKE2 itself. You've noted that some of the nodes are stuck at NotReady. Can you provide RKE2 logs from the nodes where the kubelet is not coming up? |
rke2-log.tar.gz |
the RKE2 log appears to not have uploaded? Can you make sure you get the full log? |
@brandond edited and added the full logs |
This looks like a duplicate of #4775. A workaround should be to either use the killall script to terminate the pods before upgrading; or disable the service, reboot, and then upgrade before enabling and starting the service. |
Environmental Info:
RKE2 Version:
rke2 version v1.27.4+rke2r1
go version go1.20.5 X:boringcrypto
rke2 version v1.28.2+rke2r1
go version go1.20.8 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Linux manager-1 5.15.0-84-generic #93
20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/LinuxLinux manager-2 5.15.0-83-generic #92
Linux manager-3 5.15.0-83-generic #92~20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 HA server nodes without agents/worker nodes
Describe the bug:
on a 3 node rke2 cluster with version 1.27.4 that have workloads running on Longhorn storage version 1.5.1
then on each node separately, stop the rke2 server, copy the rke2-images.linux-amd64.tar.zst and rke2 binary of 1.28.2 and start the rke2 server to start running 1.28.2
Then repeat for nodes 2 then 3
most of the times, when a RWX share manager is scheduled on a node(node1) that we are currently restarting rke2 in, the share managers get scheduled in a different healthy node (node2) but either get stuck in 0/1 Running state or go into a loop of completed then terminating then running and repeats
node1 is stuck at NotReady and the rke2 doesnt come up
Sometimes the coredns pod in node2 gets in a crashloopback state with errors
A similar error appears in the longhorn-manager pod in node2
"E1004 12:45:07.452332 1 engine_controller.go:826] failed to update status for engine pvc-a9b091f5-99c6-489f-9b5c-6542cad5dfa1-e-a9a5e808: Timeout: request did not complete within requested timeout - context deadline exceeded
"
Not sure why is there a timeout and probably that's what signalling rke2 to not come up
Steps To Reproduce:
1- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.27.4
2- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
4- start rke2 using systemctl start rke2-server
5- join other 2 nodes using the token
6- deploy longhorn using the standard manifests
7- deploy workloads on the 3 nodes that use RWX and RWO volumes
8- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.28.2
9-stop rke2 service
10- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
11- start rke2 service
12- repeat for nodes 2 then 3
13- observe one of the nodes stuck at NotReady
Expected behavior:
share managers should be scheduled normally and etcd is synced and node is ready and upgraded to 1.28.2
Actual behavior:
node became stuck at the NotReady state and share managers were either in Completed state or Stuck at 0/1 Running with errors of timeout swamping longhorn-manager and/or coredns
Additional context / logs:
manager2.log
share-manager.log
The text was updated successfully, but these errors were encountered: