Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364

Closed
vincepower opened this issue Feb 7, 2024 · 4 comments

Comments

@vincepower
Copy link

Environmental Info:
RKE2 Versions Tested:
1.26.4
1.27.10
1.28.5
1.29.1

Node(s) CPU architecture, OS, and Version:
x86_64 and tried with the following OS's
SLES 15 SP5, Rocky Linux 9, Ubuntu 22.04

Cluster Configuration:
Can be reproduced with as little as one control plane and one worker, easiest to demonstrate why its a problem with three control planes and one worker.

Describe the bug:
When the entire RKE2 cluster is shutdown and brought back online. If the agents starts before there is a quorum of control plane nodes available, the agent changes the server value in kubelet.kubeconfig to point at the default server which was used when the node was added.
So https://control-plane-1:6443 instead of https://127.0.0.1:6443

The configuration will reset to valid the next time the worker node's agent is restarted with a quorum of control planes online.

If the worker/agent stays running while the control plane is down the configuration doesn't change. It only gets changed if everything is stopped and the agents start before the control plane is ready.

Why this is a problem, is if the bootstrap address was a single node (not a load balanced address) and it does not come back line, then that worker nodes will stay in a pending state waiting for the hardcoded server to come online until something restarts the agent on the node.

Steps To Reproduce:
This is using 1.29.1, can also be reproduced on the versions listed above.

  • Installed RKE2 server on the control plane node and start it up and waited for Ready
  • Installed RKE2 agent on worker node, config.yaml points to "server: https://control-plane-1:9345", and start it up and waited for Ready
  • Ran rke2-killall.sh on both nodes (easier than powering them down)
  • Start the agent on worker node
  • 10+ seconds later start the control-plane-1
  • Check /var/lib/rancher/rke2/agent/kubelet.kubeconfig on the worker node after the control plane node is ready, and it is pointed to https://control-plane-1:6443 instead of https://127.0.0.1:6443

** Steps to correct configuration:**

  • If the agent is restarted after the control plane is up and running and it will correct itself

Expected behavior:
I would expect the kubelet.kubeconfig would not be changed to point anywhere, as https://127.0.0.1:6443 is the right address

Actual behavior:
If the control plane node used when adding the node is running then we have a single point of failure, but we're up.
If the control plane node that was used is unavailable then none of the agents which used pointed to that control plane node when being added will come online until the agents are restarted.

Additional context / logs:
N/A

@brandond
Copy link
Member

brandond commented Feb 7, 2024

That is interesting. I will say that the agent is not expected to start up successfully without an operational control-plane node.

To demonstrate why its a problem with three control planes and one worker.
Why this is a problem, is if the bootstrap address was a single node (not a load balanced address) and it does not come back line, then that worker nodes will stay in a pending state waiting for the hardcoded server to come online until something restarts the agent on the node.

You're not supposed to point it directly at a single server. It is expected that, in a HA scenario, you will have multiple servers behind a fixed registration address that the agent's --server flag points at - rather than pointing at an individual server by name. This is covered in the docs: https://docs.rke2.io/install/ha

We can take a look and see if there's any way we can make this more consistent, but since you are specifically pointing it at a single server, it is not surprising that this server will be a single point of failure for agents.

@vincepower
Copy link
Author

vincepower commented Feb 8, 2024

That's fair, and using a load balancer is our recommended way to deploy internally.

We just noticed this behavior which is unexpected.

I would never have guessed the agent process would change the configuration in kubelet.kubeconfig after the node has been successfully added to the cluster. At least not without some kind of command being run by the administrators to tell it do to that.

I don't even understand how even changing it from 127.0.0.1 to the load balancer address has value?

@brandond
Copy link
Member

brandond commented Feb 8, 2024

Yeah, I'm trying to reproduce this on my side with some additional logging added. I agree that it is weird and probably worth fixing, even if the steps to get it to occur aren't recommend.

Copy link
Contributor

github-actions bot commented Apr 5, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants