-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 agent changes server in kubelet.kubeconfig on start when control plane unavailable #5364
Comments
That is interesting. I will say that the agent is not expected to start up successfully without an operational control-plane node.
You're not supposed to point it directly at a single server. It is expected that, in a HA scenario, you will have multiple servers behind a fixed registration address that the agent's We can take a look and see if there's any way we can make this more consistent, but since you are specifically pointing it at a single server, it is not surprising that this server will be a single point of failure for agents. |
That's fair, and using a load balancer is our recommended way to deploy internally. We just noticed this behavior which is unexpected. I would never have guessed the agent process would change the configuration in kubelet.kubeconfig after the node has been successfully added to the cluster. At least not without some kind of command being run by the administrators to tell it do to that. I don't even understand how even changing it from 127.0.0.1 to the load balancer address has value? |
Yeah, I'm trying to reproduce this on my side with some additional logging added. I agree that it is weird and probably worth fixing, even if the steps to get it to occur aren't recommend. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Environmental Info:
RKE2 Versions Tested:
1.26.4
1.27.10
1.28.5
1.29.1
Node(s) CPU architecture, OS, and Version:
x86_64 and tried with the following OS's
SLES 15 SP5, Rocky Linux 9, Ubuntu 22.04
Cluster Configuration:
Can be reproduced with as little as one control plane and one worker, easiest to demonstrate why its a problem with three control planes and one worker.
Describe the bug:
When the entire RKE2 cluster is shutdown and brought back online. If the agents starts before there is a quorum of control plane nodes available, the agent changes the server value in kubelet.kubeconfig to point at the default server which was used when the node was added.
So https://control-plane-1:6443 instead of https://127.0.0.1:6443
The configuration will reset to valid the next time the worker node's agent is restarted with a quorum of control planes online.
If the worker/agent stays running while the control plane is down the configuration doesn't change. It only gets changed if everything is stopped and the agents start before the control plane is ready.
Why this is a problem, is if the bootstrap address was a single node (not a load balanced address) and it does not come back line, then that worker nodes will stay in a pending state waiting for the hardcoded server to come online until something restarts the agent on the node.
Steps To Reproduce:
This is using 1.29.1, can also be reproduced on the versions listed above.
** Steps to correct configuration:**
Expected behavior:
I would expect the kubelet.kubeconfig would not be changed to point anywhere, as https://127.0.0.1:6443 is the right address
Actual behavior:
If the control plane node used when adding the node is running then we have a single point of failure, but we're up.
If the control plane node that was used is unavailable then none of the agents which used pointed to that control plane node when being added will come online until the agents are restarted.
Additional context / logs:
N/A
The text was updated successfully, but these errors were encountered: