Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

Closed
inichols opened this issue Mar 14, 2024 · 3 comments
Assignees
Labels
kind/bug Something isn't working priority/high

Comments

@inichols
Copy link

inichols commented Mar 14, 2024

Environmental Info:
RKE2 Version: v1.27.11+rke2r1

Node(s) CPU architecture, OS, and Version:

6.2.0-36-generic #37-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 4 10:14:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Split Role Cluster
3 etcd, 3 control plane, 3 worker

Describe the bug:

Stopping rke2-server on a control plane node can cause other nodes to go into a NotReady state

Steps To Reproduce:

  • Installed RKE2:
  • Configure Split Role Cluster
  • Stop rke2-server on one of the Control Plane nodes
  • Check status of the nodes after a few seconds to allow connections to fail
  • If other nodes did not change, start it back up and try another node

Expected behavior:

Expected that only node affected would be node where service was stopped

Actual behavior:

Other nodes go down due to one nodes rke2-server process being shut down

Additional context / logs:

Logs from when rke2-server was shut down

> Mar 14 01:12:46 cilium-01 rke2[6995]: time="2024-03-14T01:12:46Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
> Mar 14 01:12:48 cilium-01 rke2[6995]: time="2024-03-14T01:12:48Z" level=debug msg="Wrote ping"
> Mar 14 01:12:50 cilium-01 rke2[6995]: time="2024-03-14T01:12:50Z" level=debug msg="Wrote ping"
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"
> Mar 14 01:12:53 cilium-01 rke2[6995]: time="2024-03-14T01:12:53Z" level=debug msg="Wrote ping"
> Mar 14 01:12:55 cilium-01 rke2[6995]: time="2024-03-14T01:12:55Z" level=debug msg="Wrote ping"
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"
> Mar 14 01:12:58 cilium-01 rke2[6995]: time="2024-03-14T01:12:58Z" level=debug msg="Wrote ping"
> Mar 14 01:13:00 cilium-01 rke2[6995]: time="2024-03-14T01:13:00Z" level=debug msg="Wrote ping"
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"
@brandond
Copy link
Member

brandond commented Mar 14, 2024

On control-plane-only nodes, the apiserver connects to etcd through its own client loadbalancer, same as agents connect to the apiserver. That loadbalancer runs in the rke2 supervisor process, and when you terminate it, the apiserver loses its connection to etcd - but doesn’t stop running. So the agents are all still connected to it, and their client-side loadbalancer doesn’t fail over, because nothing has gone down. They’re just stuck talking to an apiserver that can’t serve anything due to not having a functioning datastore. The other apiservers all see that kubelets and other clients that were talking to that apiserver just went silent.

We'll probably need to make some improvements to the apiserver and etcd loadbalancers so that they properly handle this and proactively fail over traffic away from the node in question. We can also look at trying to improve the etcd loadbalancer so that the control-plane-only node's connection isn't disrupted when the supervisor is stopped, but that would probably be a more invasive change.

@brandond brandond self-assigned this Mar 14, 2024
@brandond brandond added priority/high kind/bug Something isn't working labels Mar 14, 2024
@brandond brandond added this to the v1.29.4+rke2r1 milestone Mar 14, 2024
@brandond brandond changed the title Stopping rke-server on ControlPlane node causes other nodes to go NotReady Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady Mar 19, 2024
@tmmorin
Copy link

tmmorin commented Apr 8, 2024

FTR, I observed what I think is a variant of this issue on RKE2 v1.28.8+rke2r1: local proxy on 9345 dispatching to all RKE2 servers including one that would be stopped or being installed but not fully ready yet.

@fmoral2
Copy link
Contributor

fmoral2 commented Apr 15, 2024

Validated on Version:

-$ rke2 version v1.29.3+dev.9a8df95e (e9ac287a8efdd606f41849d96744d4679449b26f)



Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
SUSE Linux Enterprise Server 15 SP4

Cluster Configuration:
Split roles:

  • 2 cp only
  • 2 etcd
  • 1 worker

Steps to validate the fix

  1. create split roles cluster
  2. Stop server on one of the cp only
  3. check other nodes
  4. validate no other node is not ready or inactive
  5. validate pods

Reproduction Issue:

 
 rke2 -v
rke2 version v1.27.11+rke2r1 (6665618680112568f79b1f5992aecf4655e3cf8b)
go version go1.21.7 X:boringcrypto

 
on a CP-ONLY:
$  sudo systemctl  stop rke2-server


on another node:



$ k get nodes -o wide
NAME                                          STATUS     ROLES                  AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                              KERNEL-VERSION              CONTAINER-RUNTIME
ip-172-31-1-96.us-east-2.compute.internal     Ready      <none>                 50m   v1.27.11+rke2r1   172.31.1.96     3.128.30.47     SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-186.us-east-2.compute.internal   NotReady   control-plane,master   59m   v1.27.11+rke2r1   172.31.10.186   3.144.224.153   SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-201.us-east-2.compute.internal   Ready      control-plane,master   59m   v1.27.11+rke2r1   172.31.10.201   18.224.62.92    SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-59.us-east-2.compute.internal    NotReady   etcd   
 


Validation Results:

       
       
- `Tried from 2 diff control plane nodes. It only stops the requested one.`

CP -2 
        k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-0-110.us-east-2.compute.internal    Ready      control-plane,master        18m   v1.29.3+rke2r1
ip-172-31-0-156.us-east-2.compute.internal    NotReady   control-plane,master        18m   v1.29.3+rke2r1
ip-172-31-13-128.us-east-2.compute.internal   Ready      etcd                        18m   v1.29.3+rke2r1
ip-172-31-13-65.us-east-2.compute.internal    Ready      control-plane,etcd,master   23m   v1.29.3+rke2r1
ip-172-31-2-129.us-east-2.compute.internal    Ready      <none>                      15m   v1.29.3+rke2r1
ip-172-31-7-162.us-east-2.compute.internal    Ready      etcd                        18m   v1.29.3+rke2r1
ip-172-31-7-255.us-east-2.compute.internal    Ready      <none>                      14m   v1.29.3+rke2r1
ip-172-31-8-35.us-east-2.compute.internal     Ready      <none>                      14m   v1.29.3+rke2r1


CP -1 
 k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-0-110.us-east-2.compute.internal    NotReady   control-plane,master        20m   v1.29.3+rke2r1
ip-172-31-0-156.us-east-2.compute.internal    Ready      control-plane,master        20m   v1.29.3+rke2r1
ip-172-31-13-128.us-east-2.compute.internal   Ready      etcd                        20m   v1.29.3+rke2r1
ip-172-31-13-65.us-east-2.compute.internal    Ready      control-plane,etcd,master   25m   v1.29.3+rke2r1
ip-172-31-2-129.us-east-2.compute.internal    Ready      <none>                      17m   v1.29.3+rke2r1
ip-172-31-7-162.us-east-2.compute.internal    Ready      etcd                        21m   v1.29.3+rke2r1
ip-172-31-7-255.us-east-2.compute.internal    Ready      <none>                      17m   v1.29.3+rke2r1
ip-172-31-8-35.us-east-2.compute.internal     Ready      <none>                      17m   v1.29.3+rke2r1



Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/high
Projects
None yet
Development

No branches or pull requests

4 participants