Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

inichols · 2024-03-14T14:01:43Z

Environmental Info:
RKE2 Version: v1.27.11+rke2r1

Node(s) CPU architecture, OS, and Version:

6.2.0-36-generic #37-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 4 10:14:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Split Role Cluster
3 etcd, 3 control plane, 3 worker

Describe the bug:

Stopping rke2-server on a control plane node can cause other nodes to go into a NotReady state

Steps To Reproduce:

Installed RKE2:
Configure Split Role Cluster
Stop rke2-server on one of the Control Plane nodes
Check status of the nodes after a few seconds to allow connections to fail
If other nodes did not change, start it back up and try another node

Expected behavior:

Expected that only node affected would be node where service was stopped

Actual behavior:

Other nodes go down due to one nodes rke2-server process being shut down

Additional context / logs:

Logs from when rke2-server was shut down

> Mar 14 01:12:46 cilium-01 rke2[6995]: time="2024-03-14T01:12:46Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
> Mar 14 01:12:48 cilium-01 rke2[6995]: time="2024-03-14T01:12:48Z" level=debug msg="Wrote ping"
> Mar 14 01:12:50 cilium-01 rke2[6995]: time="2024-03-14T01:12:50Z" level=debug msg="Wrote ping"
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:12:51 cilium-01 rke2[6995]: time="2024-03-14T01:12:51Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"
> Mar 14 01:12:53 cilium-01 rke2[6995]: time="2024-03-14T01:12:53Z" level=debug msg="Wrote ping"
> Mar 14 01:12:55 cilium-01 rke2[6995]: time="2024-03-14T01:12:55Z" level=debug msg="Wrote ping"
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:12:56 cilium-01 rke2[6995]: time="2024-03-14T01:12:56Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"
> Mar 14 01:12:58 cilium-01 rke2[6995]: time="2024-03-14T01:12:58Z" level=debug msg="Wrote ping"
> Mar 14 01:13:00 cilium-01 rke2[6995]: time="2024-03-14T01:13:00Z" level=debug msg="Wrote ping"
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=info msg="Connecting to proxy" url="wss://192.168.86.56:9345/v1-rke2/connect"
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.86.56:9345: connect: connection re>
> Mar 14 01:13:01 cilium-01 rke2[6995]: time="2024-03-14T01:13:01Z" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.86.56:9345: connect: connection refused"

The text was updated successfully, but these errors were encountered:

brandond · 2024-03-14T16:24:17Z

On control-plane-only nodes, the apiserver connects to etcd through its own client loadbalancer, same as agents connect to the apiserver. That loadbalancer runs in the rke2 supervisor process, and when you terminate it, the apiserver loses its connection to etcd - but doesn’t stop running. So the agents are all still connected to it, and their client-side loadbalancer doesn’t fail over, because nothing has gone down. They’re just stuck talking to an apiserver that can’t serve anything due to not having a functioning datastore. The other apiservers all see that kubelets and other clients that were talking to that apiserver just went silent.

We'll probably need to make some improvements to the apiserver and etcd loadbalancers so that they properly handle this and proactively fail over traffic away from the node in question. We can also look at trying to improve the etcd loadbalancer so that the control-plane-only node's connection isn't disrupted when the supervisor is stopped, but that would probably be a more invasive change.

tmmorin · 2024-04-08T20:04:44Z

FTR, I observed what I think is a variant of this issue on RKE2 v1.28.8+rke2r1: local proxy on 9345 dispatching to all RKE2 servers including one that would be stopped or being installed but not fully ready yet.

fmoral2 · 2024-04-15T11:02:09Z

Validated on Version:

-$ rke2 version v1.29.3+dev.9a8df95e (e9ac287a8efdd606f41849d96744d4679449b26f)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
SUSE Linux Enterprise Server 15 SP4

Cluster Configuration:
Split roles:

2 cp only
2 etcd
1 worker

Steps to validate the fix

create split roles cluster
Stop server on one of the cp only
check other nodes
validate no other node is not ready or inactive
validate pods

Reproduction Issue:

 
 rke2 -v
rke2 version v1.27.11+rke2r1 (6665618680112568f79b1f5992aecf4655e3cf8b)
go version go1.21.7 X:boringcrypto

 
on a CP-ONLY:
$  sudo systemctl  stop rke2-server


on another node:



$ k get nodes -o wide
NAME                                          STATUS     ROLES                  AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                              KERNEL-VERSION              CONTAINER-RUNTIME
ip-172-31-1-96.us-east-2.compute.internal     Ready      <none>                 50m   v1.27.11+rke2r1   172.31.1.96     3.128.30.47     SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-186.us-east-2.compute.internal   NotReady   control-plane,master   59m   v1.27.11+rke2r1   172.31.10.186   3.144.224.153   SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-201.us-east-2.compute.internal   Ready      control-plane,master   59m   v1.27.11+rke2r1   172.31.10.201   18.224.62.92    SUSE Linux Enterprise Server 15 SP4   5.14.21-150400.22-default   containerd://1.7.11-k3s2
ip-172-31-10-59.us-east-2.compute.internal    NotReady   etcd

Validation Results:

       
       
- `Tried from 2 diff control plane nodes. It only stops the requested one.`

CP -2 
        k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-0-110.us-east-2.compute.internal    Ready      control-plane,master        18m   v1.29.3+rke2r1
ip-172-31-0-156.us-east-2.compute.internal    NotReady   control-plane,master        18m   v1.29.3+rke2r1
ip-172-31-13-128.us-east-2.compute.internal   Ready      etcd                        18m   v1.29.3+rke2r1
ip-172-31-13-65.us-east-2.compute.internal    Ready      control-plane,etcd,master   23m   v1.29.3+rke2r1
ip-172-31-2-129.us-east-2.compute.internal    Ready      <none>                      15m   v1.29.3+rke2r1
ip-172-31-7-162.us-east-2.compute.internal    Ready      etcd                        18m   v1.29.3+rke2r1
ip-172-31-7-255.us-east-2.compute.internal    Ready      <none>                      14m   v1.29.3+rke2r1
ip-172-31-8-35.us-east-2.compute.internal     Ready      <none>                      14m   v1.29.3+rke2r1


CP -1 
 k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-0-110.us-east-2.compute.internal    NotReady   control-plane,master        20m   v1.29.3+rke2r1
ip-172-31-0-156.us-east-2.compute.internal    Ready      control-plane,master        20m   v1.29.3+rke2r1
ip-172-31-13-128.us-east-2.compute.internal   Ready      etcd                        20m   v1.29.3+rke2r1
ip-172-31-13-65.us-east-2.compute.internal    Ready      control-plane,etcd,master   25m   v1.29.3+rke2r1
ip-172-31-2-129.us-east-2.compute.internal    Ready      <none>                      17m   v1.29.3+rke2r1
ip-172-31-7-162.us-east-2.compute.internal    Ready      etcd                        21m   v1.29.3+rke2r1
ip-172-31-7-255.us-east-2.compute.internal    Ready      <none>                      17m   v1.29.3+rke2r1
ip-172-31-8-35.us-east-2.compute.internal     Ready      <none>                      17m   v1.29.3+rke2r1

brandond self-assigned this Mar 14, 2024

brandond added priority/high kind/bug Something isn't working labels Mar 14, 2024

brandond added this to the v1.29.4+rke2r1 milestone Mar 14, 2024

brandond mentioned this issue Mar 19, 2024

Add health-check support to loadbalancer k3s-io/k3s#9757

Merged

brandond changed the title ~~Stopping rke-server on ControlPlane node causes other nodes to go NotReady~~ Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady Mar 19, 2024

aganesh-suse assigned fmoral2 Apr 8, 2024

fmoral2 closed this as completed Apr 15, 2024

jakefhyde mentioned this issue May 15, 2024

[BUG] Stopping rke-server on ControlPlane node causes other nodes to go NotReady rancher/rancher#45502

Closed

tmmorin mentioned this issue Jun 25, 2024

Agent loadbalancer may deadlock when servers are removed #6208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

inichols commented Mar 14, 2024 •

edited by brandond

Loading

brandond commented Mar 14, 2024 •

edited

Loading

tmmorin commented Apr 8, 2024

fmoral2 commented Apr 15, 2024

Validation Results:

Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

Stopping rke2-server service on ControlPlane node causes other nodes to go NotReady #5614

Comments

inichols commented Mar 14, 2024 • edited by brandond Loading

brandond commented Mar 14, 2024 • edited Loading

tmmorin commented Apr 8, 2024

fmoral2 commented Apr 15, 2024

Validated on Version:

Environment Details

Steps to validate the fix

Reproduction Issue:

Validation Results:

inichols commented Mar 14, 2024 •

edited by brandond

Loading

brandond commented Mar 14, 2024 •

edited

Loading