Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

michaeldmitry · 2023-10-04T12:50:09Z

Environmental Info:
RKE2 Version:

rke2 version v1.27.4+rke2r1
go version go1.20.5 X:boringcrypto

rke2 version v1.28.2+rke2r1
go version go1.20.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux manager-1 5.15.0-84-generic #9320.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Linux manager-2 5.15.0-83-generic #9220.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Linux manager-3 5.15.0-83-generic #92~20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 HA server nodes without agents/worker nodes

Describe the bug:

on a 3 node rke2 cluster with version 1.27.4 that have workloads running on Longhorn storage version 1.5.1
then on each node separately, stop the rke2 server, copy the rke2-images.linux-amd64.tar.zst and rke2 binary of 1.28.2 and start the rke2 server to start running 1.28.2
Then repeat for nodes 2 then 3
most of the times, when a RWX share manager is scheduled on a node(node1) that we are currently restarting rke2 in, the share managers get scheduled in a different healthy node (node2) but either get stuck in 0/1 Running state or go into a loop of completed then terminating then running and repeats
node1 is stuck at NotReady and the rke2 doesnt come up
Sometimes the coredns pod in node2 gets in a crashloopback state with errors

A similar error appears in the longhorn-manager pod in node2
"E1004 12:45:07.452332 1 engine_controller.go:826] failed to update status for engine pvc-a9b091f5-99c6-489f-9b5c-6542cad5dfa1-e-a9a5e808: Timeout: request did not complete within requested timeout - context deadline exceeded
"

Not sure why is there a timeout and probably that's what signalling rke2 to not come up

Steps To Reproduce:

Installed RKE2:
1- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.27.4
2- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
4- start rke2 using systemctl start rke2-server
5- join other 2 nodes using the token
6- deploy longhorn using the standard manifests
7- deploy workloads on the 3 nodes that use RWX and RWO volumes
8- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.28.2
9-stop rke2 service
10- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
11- start rke2 service
12- repeat for nodes 2 then 3
13- observe one of the nodes stuck at NotReady
Expected behavior:

share managers should be scheduled normally and etcd is synced and node is ready and upgraded to 1.28.2
Actual behavior:

node became stuck at the NotReady state and share managers were either in Completed state or Stuck at 0/1 Running with errors of timeout swamping longhorn-manager and/or coredns
Additional context / logs:

manager2.log
share-manager.log

brandond · 2023-10-04T17:11:45Z

You've provided a lot of info on longhorn, but this is the RKE2 repo and you've not provided any information on the status of RKE2 itself. You've noted that some of the nodes are stuck at NotReady. Can you provide RKE2 logs from the nodes where the kubelet is not coming up?

michaeldmitry · 2023-10-04T17:19:06Z

rke2-log.tar.gz
rke2-status.log
missed those sorry

brandond · 2023-10-04T18:03:36Z

the RKE2 log appears to not have uploaded?

Can you make sure you get the full log? journalctl -u rke2-server --no-pager >rke2.log

michaeldmitry · 2023-10-04T21:58:35Z

@brandond edited and added the full logs
let me know if other logs are needed as well, thanks!
btw, logs will show a trial upgrade from 1.27.5 to 1.28.2, which results in the exact same issue as from 1.27.4 to 1.28.2

brandond · 2023-10-04T23:53:19Z

This looks like a duplicate of #4775.

A workaround should be to either use the killall script to terminate the pods before upgrading; or disable the service, reboot, and then upgrade before enabling and starting the service.

brandond closed this as completed Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

michaeldmitry commented Oct 4, 2023

brandond commented Oct 4, 2023 •

edited

Loading

michaeldmitry commented Oct 4, 2023 •

edited

Loading

brandond commented Oct 4, 2023 •

edited

Loading

michaeldmitry commented Oct 4, 2023

brandond commented Oct 4, 2023

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

Comments

michaeldmitry commented Oct 4, 2023

brandond commented Oct 4, 2023 • edited Loading

michaeldmitry commented Oct 4, 2023 • edited Loading

brandond commented Oct 4, 2023 • edited Loading

michaeldmitry commented Oct 4, 2023

brandond commented Oct 4, 2023

brandond commented Oct 4, 2023 •

edited

Loading

michaeldmitry commented Oct 4, 2023 •

edited

Loading

brandond commented Oct 4, 2023 •

edited

Loading