-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rke2-server concurrency issue with big clusters and rke2-agent-load-balancer misbehave #4975
Comments
As discussed on slack, there appear to be a couple things going on here:
While I do believe that we could perhaps make some improvements on both the client-side load-balancer behavior, and in the server certificate generation code path, I think the more immediate fix is to avoid overloading the server by overloading by starting 300+ agents all at the same time. Stagger them randomly over a period of 1-5 minutes, or in batches of 25-50 with a minute wait in between. |
The flame graph shared on slack suggests that most of the CPU time on the server is being consumed by the constant-time password comparison that is done when validating the node password against the contents of the node password secret: We could probably also use a cache for the node password secrets instead of getting them directly, but that doesn't appear to be the bottle-neck, at least not according to this trace. |
Thank you @brandond. Yes, as for now the easiest way is to stagger the startups, although I believe there's room for improvement here. As I see there are multiple methods to fix this and the easiest way is implementing
Optionally - as you mentioned - the server side could pre-cache and WATCH the node-password secrets. With that, I believe the problem with throttling would mostly be gone. Increasing the default QPS there wouldn't hurt at all tho' |
I'm not sure how an exponential backoff would fix this. I suppose we could time out that request a bit more aggressively, although I'm not sold on changing the retry behavior. I'll have to do some testing. The high CPU utilization is not being caused by retrieval of the node secrets, it is coming from scrypt validation of the node password provided in the request against the version stored in the secret. It would be quite easy to wire up a cache here to pre-load the secrets so that they aren't pulled directly from the apiserver on demand, but this would not do anything to address the load associated with concurrently validating 300+ scrypt hashes. I suspect that pre-caching them will provide a moderate performance improvement, but it's not going to solve the Thundering Herd problem that you've created by attempting to start every node in your cluster all at once. |
How the exponential backoff would solve this is that instead of hammering rke2-server every 10 seconds, it'll give some time to process some requests, and if not all requests were possible to serve, after some time it'll again process some requests until everything is completed. Also - without having deep knowledge here - what is the reason of using |
I'm not sure if we want a full exponential backoff, bit we can certainly look at changing the timing. Perhaps just increasing backoff with jitter.
|
Yes, I think that already would make a huge difference. |
This adds a 1.0 jitter factor to the retry, so it will now be somewhere random between 5 and 10 seconds. It also enables cache for the node password secret retrieval. I'm curious if you see any difference with this. If this doesn't make a significant impact in your environment we can look at an increasing backoff on the retry, but I don't want to make things worse for other users if we can help it. |
Thank you @brandond! Will you build an RC we could try on? (Preferably an 1.25 RC to not have too many changes in between) |
Not yet. Also 1.25 is end of life, although we may get one last unexpected patch for it. It is time to upgrade. |
With regards for what to test test - I think just confirming that the agent retries joins at non-fixed intervals (somewhere between 5-10 seconds, instead of exactly 5 seconds) is sufficient. |
Validated using v1.28.4-rc1+rke2r1Created a cluster with 3 server nodes and 3 agents. Rebooted all agent nodes at once multiple times. They came back up normally. I have not tried reproduce the issue considering the as the count of agent nodes. |
Environmental Info:
RKE2 Version:
v1.25.14+rke2r1 (36d7417)
Node(s) CPU architecture, OS, and Version:
5.15.0-86-generic #96~20.04.1-Ubuntu SMP x86_64 GNU/Linux
Cluster Configuration:
3 control planes (12vCPU), 5 etcd, 318 worker nodes
Describe the bug:
Either during installation or a big-bang reboot of RKE2, the
v1-rke2/serving-kubelet.crt
endpoints are trapped in some concurrency issue, often consuming all CPUs on certain control-plane nodes.Also at the same time, the
rke2-agent-load-balancer
on the worker nodes does not consider multiple control-plane nodes and under certain conditions even removing working control-plane nodes from the loadbalancer.The only solution so far is restarting the
rke2-server
on the control-plane nodes - often multiple times - until all new Kubelet certificates are served and thus nodes could come online.Steps To Reproduce:
Expected behavior:
During installation/reboot, the
v1-rke2/serving-kubelet.crt
would serve the requested certificate or if it times out on the agent for any reason, the agent should use the already existing certificate - if it exists. At the same time, the API (as I know this comes pretty much from k3s) should be able to serve 300+ certificates in time, it's not that big of a deal after all.Actual behavior:
Either during install or reboot of a bigger cluster, the certificates are only served to a handful of nodes and then all get stuck.
Additional context / logs:
I performed some extensive debugging to understand the situation and concluded that there were multiple problems. Node names are pretty much irrelevant as the issue is not with one or two nodes, but a lot.
Scenario: Reboot the whole cluster (all nodes, control-planes, etcds, workers) to see how it recovers.
First, huge churn of such events on control-plane nodes, and huge CPU usage:
Then, it looks like the default k8s qps used here - and I can't configure it - and starts to throttle on k8s client side. I saw throttles for hours too!
On the rke2-agent side the issue manifests like the following (retries every 10s):
Ok, let's see how it behaves using curl, maybe I can see more, but it just hangs:
Although when I check the
https://127.0.0.1:6444/v1-rke2/readyz
it saysok
, so it should work.Then the
rke2-agent-load-balancer
issue:This time - for some unknown reason, the agent LB removed the first control-plane node and used a control-plane node that was in the state I described above.
even tho' I restarted the rke2-agent, it always tried to use the same node, and never tried any other control-plane nodes to get the cert. It was stuck until I restarted
rke2-server
on the stuck control-plane node.My observations:
v1-rke2/serving-kubelet.crt
endpoint should not end up in the stuck statePlease let me know if you need any more information.
The text was updated successfully, but these errors were encountered: