RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

vincepower · 2024-02-07T21:42:26Z

Environmental Info:
RKE2 Versions Tested:
1.26.4
1.27.10
1.28.5
1.29.1

Node(s) CPU architecture, OS, and Version:
x86_64 and tried with the following OS's
SLES 15 SP5, Rocky Linux 9, Ubuntu 22.04

Cluster Configuration:
Can be reproduced with as little as one control plane and one worker, easiest to demonstrate why its a problem with three control planes and one worker.

Describe the bug:
When the entire RKE2 cluster is shutdown and brought back online. If the agents starts before there is a quorum of control plane nodes available, the agent changes the server value in kubelet.kubeconfig to point at the default server which was used when the node was added.
So https://control-plane-1:6443 instead of https://127.0.0.1:6443

The configuration will reset to valid the next time the worker node's agent is restarted with a quorum of control planes online.

If the worker/agent stays running while the control plane is down the configuration doesn't change. It only gets changed if everything is stopped and the agents start before the control plane is ready.

Why this is a problem, is if the bootstrap address was a single node (not a load balanced address) and it does not come back line, then that worker nodes will stay in a pending state waiting for the hardcoded server to come online until something restarts the agent on the node.

Steps To Reproduce:
This is using 1.29.1, can also be reproduced on the versions listed above.

Installed RKE2 server on the control plane node and start it up and waited for Ready
Installed RKE2 agent on worker node, config.yaml points to "server: https://control-plane-1:9345", and start it up and waited for Ready
Ran rke2-killall.sh on both nodes (easier than powering them down)
Start the agent on worker node
10+ seconds later start the control-plane-1
Check /var/lib/rancher/rke2/agent/kubelet.kubeconfig on the worker node after the control plane node is ready, and it is pointed to https://control-plane-1:6443 instead of https://127.0.0.1:6443

** Steps to correct configuration:**

If the agent is restarted after the control plane is up and running and it will correct itself

Expected behavior:
I would expect the kubelet.kubeconfig would not be changed to point anywhere, as https://127.0.0.1:6443 is the right address

Actual behavior:
If the control plane node used when adding the node is running then we have a single point of failure, but we're up.
If the control plane node that was used is unavailable then none of the agents which used pointed to that control plane node when being added will come online until the agents are restarted.

Additional context / logs:
N/A

brandond · 2024-02-07T23:45:14Z

That is interesting. I will say that the agent is not expected to start up successfully without an operational control-plane node.

To demonstrate why its a problem with three control planes and one worker.
Why this is a problem, is if the bootstrap address was a single node (not a load balanced address) and it does not come back line, then that worker nodes will stay in a pending state waiting for the hardcoded server to come online until something restarts the agent on the node.

You're not supposed to point it directly at a single server. It is expected that, in a HA scenario, you will have multiple servers behind a fixed registration address that the agent's --server flag points at - rather than pointing at an individual server by name. This is covered in the docs: https://docs.rke2.io/install/ha

We can take a look and see if there's any way we can make this more consistent, but since you are specifically pointing it at a single server, it is not surprising that this server will be a single point of failure for agents.

vincepower · 2024-02-08T04:19:07Z

That's fair, and using a load balancer is our recommended way to deploy internally.

We just noticed this behavior which is unexpected.

I would never have guessed the agent process would change the configuration in kubelet.kubeconfig after the node has been successfully added to the cluster. At least not without some kind of command being run by the administrators to tell it do to that.

I don't even understand how even changing it from 127.0.0.1 to the load balancer address has value?

brandond · 2024-02-08T06:17:06Z

Yeah, I'm trying to reproduce this on my side with some additional logging added. I agree that it is weird and probably worth fixing, even if the steps to get it to occur aren't recommend.

github-actions · 2024-04-05T20:11:25Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions bot added the status/stale label Apr 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

vincepower commented Feb 7, 2024

brandond commented Feb 7, 2024 •

edited

Loading

vincepower commented Feb 8, 2024 •

edited

Loading

brandond commented Feb 8, 2024

github-actions bot commented Apr 5, 2024

RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

Comments

vincepower commented Feb 7, 2024

brandond commented Feb 7, 2024 • edited Loading

vincepower commented Feb 8, 2024 • edited Loading

brandond commented Feb 8, 2024

github-actions bot commented Apr 5, 2024

brandond commented Feb 7, 2024 •

edited

Loading

vincepower commented Feb 8, 2024 •

edited

Loading