Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

Closed
michaeldmitry opened this issue Oct 4, 2023 · 5 comments
Closed

Nodes stuck at NotReady after Upgrading from 1.27.4 to 1.28.2 #4833

michaeldmitry opened this issue Oct 4, 2023 · 5 comments

Comments

@michaeldmitry
Copy link

Environmental Info:
RKE2 Version:

rke2 version v1.27.4+rke2r1
go version go1.20.5 X:boringcrypto

rke2 version v1.28.2+rke2r1
go version go1.20.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux manager-1 5.15.0-84-generic #9320.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Linux manager-2 5.15.0-83-generic #92
20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Linux manager-3 5.15.0-83-generic #92~20.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 HA server nodes without agents/worker nodes

Describe the bug:

on a 3 node rke2 cluster with version 1.27.4 that have workloads running on Longhorn storage version 1.5.1
then on each node separately, stop the rke2 server, copy the rke2-images.linux-amd64.tar.zst and rke2 binary of 1.28.2 and start the rke2 server to start running 1.28.2
Then repeat for nodes 2 then 3
most of the times, when a RWX share manager is scheduled on a node(node1) that we are currently restarting rke2 in, the share managers get scheduled in a different healthy node (node2) but either get stuck in 0/1 Running state or go into a loop of completed then terminating then running and repeats
node1 is stuck at NotReady and the rke2 doesnt come up
Sometimes the coredns pod in node2 gets in a crashloopback state with errors
image
A similar error appears in the longhorn-manager pod in node2
"E1004 12:45:07.452332 1 engine_controller.go:826] failed to update status for engine pvc-a9b091f5-99c6-489f-9b5c-6542cad5dfa1-e-a9a5e808: Timeout: request did not complete within requested timeout - context deadline exceeded
"

Not sure why is there a timeout and probably that's what signalling rke2 to not come up

Steps To Reproduce:

  • Installed RKE2:
    1- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.27.4
    2- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
    4- start rke2 using systemctl start rke2-server
    5- join other 2 nodes using the token
    6- deploy longhorn using the standard manifests
    7- deploy workloads on the 3 nodes that use RWX and RWO volumes
    8- get rke2-images.linux-amd64.tar.zst & rke2.linux-amd64.tar.gz of version 1.28.2
    9-stop rke2 service
    10- extract and put rke2 binary in /usr/local/bin/rke2 & images in /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
    11- start rke2 service
    12- repeat for nodes 2 then 3
    13- observe one of the nodes stuck at NotReady
    Expected behavior:

share managers should be scheduled normally and etcd is synced and node is ready and upgraded to 1.28.2
Actual behavior:

node became stuck at the NotReady state and share managers were either in Completed state or Stuck at 0/1 Running with errors of timeout swamping longhorn-manager and/or coredns
Additional context / logs:

manager2.log
share-manager.log

@brandond
Copy link
Member

brandond commented Oct 4, 2023

You've provided a lot of info on longhorn, but this is the RKE2 repo and you've not provided any information on the status of RKE2 itself. You've noted that some of the nodes are stuck at NotReady. Can you provide RKE2 logs from the nodes where the kubelet is not coming up?

@michaeldmitry
Copy link
Author

michaeldmitry commented Oct 4, 2023

rke2-log.tar.gz
rke2-status.log
missed those sorry

@brandond
Copy link
Member

brandond commented Oct 4, 2023

the RKE2 log appears to not have uploaded?

Can you make sure you get the full log? journalctl -u rke2-server --no-pager >rke2.log

@michaeldmitry
Copy link
Author

@brandond edited and added the full logs
let me know if other logs are needed as well, thanks!
btw, logs will show a trial upgrade from 1.27.5 to 1.28.2, which results in the exact same issue as from 1.27.4 to 1.28.2

@brandond
Copy link
Member

brandond commented Oct 4, 2023

This looks like a duplicate of #4775.

A workaround should be to either use the killall script to terminate the pods before upgrading; or disable the service, reboot, and then upgrade before enabling and starting the service.

@brandond brandond closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants