-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614
Comments
On control-plane-only nodes, the apiserver connects to etcd through its own client loadbalancer, same as agents connect to the apiserver. That loadbalancer runs in the rke2 supervisor process, and when you terminate it, the apiserver loses its connection to etcd - but doesn’t stop running. So the agents are all still connected to it, and their client-side loadbalancer doesn’t fail over, because nothing has gone down. They’re just stuck talking to an apiserver that can’t serve anything due to not having a functioning datastore. The other apiservers all see that kubelets and other clients that were talking to that apiserver just went silent. We'll probably need to make some improvements to the apiserver and etcd loadbalancers so that they properly handle this and proactively fail over traffic away from the node in question. We can also look at trying to improve the etcd loadbalancer so that the control-plane-only node's connection isn't disrupted when the supervisor is stopped, but that would probably be a more invasive change. |
FTR, I observed what I think is a variant of this issue on RKE2 v1.28.8+rke2r1: local proxy on 9345 dispatching to all RKE2 servers including one that would be stopped or being installed but not fully ready yet. |
Validated on Version:-$ rke2 version v1.29.3+dev.9a8df95e (e9ac287a8efdd606f41849d96744d4679449b26f)
Environment DetailsInfrastructure Node(s) CPU architecture, OS, and Version: Cluster Configuration:
Steps to validate the fix
Reproduction Issue:
Validation Results:
|
Environmental Info:
RKE2 Version: v1.27.11+rke2r1
Node(s) CPU architecture, OS, and Version:
6.2.0-36-generic #37-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 4 10:14:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Split Role Cluster
3 etcd, 3 control plane, 3 worker
Describe the bug:
Stopping rke2-server on a control plane node can cause other nodes to go into a NotReady state
Steps To Reproduce:
Expected behavior:
Expected that only node affected would be node where service was stopped
Actual behavior:
Other nodes go down due to one nodes rke2-server process being shut down
Additional context / logs:
Logs from when rke2-server was shut down
The text was updated successfully, but these errors were encountered: