Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2-server concurrency issue with big clusters and rke2-agent-load-balancer misbehave #4975

Closed
Sea-you opened this issue Nov 1, 2023 · 12 comments
Assignees

Comments

@Sea-you
Copy link

Sea-you commented Nov 1, 2023

Environmental Info:
RKE2 Version:
v1.25.14+rke2r1 (36d7417)

Node(s) CPU architecture, OS, and Version:
5.15.0-86-generic #96~20.04.1-Ubuntu SMP x86_64 GNU/Linux

Cluster Configuration:
3 control planes (12vCPU), 5 etcd, 318 worker nodes
Describe the bug:
Either during installation or a big-bang reboot of RKE2, the v1-rke2/serving-kubelet.crt endpoints are trapped in some concurrency issue, often consuming all CPUs on certain control-plane nodes.

Also at the same time, the rke2-agent-load-balancer on the worker nodes does not consider multiple control-plane nodes and under certain conditions even removing working control-plane nodes from the loadbalancer.

The only solution so far is restarting the rke2-server on the control-plane nodes - often multiple times - until all new Kubelet certificates are served and thus nodes could come online.

Steps To Reproduce:

  • Installed RKE2: Vanilla install with Calico using Terraform, nothing particular.

Expected behavior:
During installation/reboot, the v1-rke2/serving-kubelet.crt would serve the requested certificate or if it times out on the agent for any reason, the agent should use the already existing certificate - if it exists. At the same time, the API (as I know this comes pretty much from k3s) should be able to serve 300+ certificates in time, it's not that big of a deal after all.

Actual behavior:
Either during install or reboot of a bigger cluster, the certificates are only served to a handful of nodes and then all get stuck.

Additional context / logs:
I performed some extensive debugging to understand the situation and concluded that there were multiple problems. Node names are pretty much irrelevant as the issue is not with one or two nodes, but a lot.

Scenario: Reboot the whole cluster (all nodes, control-planes, etcds, workers) to see how it recovers.

First, huge churn of such events on control-plane nodes, and huge CPU usage:

...
Nov 01 11:20:46 controlplane-3 rke2[96790]: level=info msg="certificate CN=worker-172 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 11:20:46 +0000 UTC"
Nov 01 11:20:46 controlplane-3 rke2[96790]: level=info msg="certificate CN=worker-287 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 11:20:46 +0000 UTC"
Nov 01 11:20:46 controlplane-3 rke2[96790]: level=info msg="certificate CN=worker-26 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 11:20:46 +0000 UTC"
...

Then, it looks like the default k8s qps used here - and I can't configure it - and starts to throttle on k8s client side. I saw throttles for hours too!

 <control-plane-1> rke2[91299]: 91299 request.go:690] Waited for 46.830122287s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/api/v1/namespaces/kube-system/secrets/worker-196.node-password.rke2

On the rke2-agent side the issue manifests like the following (retries every 10s):

worker-5 rke2[51312]: level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
worker-5 rke2[51312]:  level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
worker-5 rke2[51312]:  level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Ok, let's see how it behaves using curl, maybe I can see more, but it just hangs:

# curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 127.0.0.1:6444...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 6444 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55da0c6b7300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 127.0.0.1:6444
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!

Although when I check the https://127.0.0.1:6444/v1-rke2/readyz it says ok, so it should work.


Then the rke2-agent-load-balancer issue:

msg="Adding server to load balancer rke2-agent-load-balancer: <control-plane-1>:9345"
msg="Adding server to load balancer rke2-agent-load-balancer: <control-plane-3>:9345"
msg="Removing server from load balancer rke2-agent-load-balancer: <control-plane-1>:9345"
msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [<control-plane-3>:9345] [default: <control-plane-1>:9345]"
msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

This time - for some unknown reason, the agent LB removed the first control-plane node and used a control-plane node that was in the state I described above.

even tho' I restarted the rke2-agent, it always tried to use the same node, and never tried any other control-plane nodes to get the cert. It was stuck until I restarted rke2-server on the stuck control-plane node.

My observations:

  • The agent loadbalancer should always try to use all control-plane nodes, in a round-robin fashion, if one times out, try the next one.
  • The v1-rke2/serving-kubelet.crt endpoint should not end up in the stuck state
  • QPS should be configurable/defaults (10 +5 burst as far as I remember) to avoid the throttling

Please let me know if you need any more information.

@brandond
Copy link
Member

brandond commented Nov 1, 2023

As discussed on slack, there appear to be a couple things going on here:

  1. your shutdown sequence has left the agents with only a single server cached in the load-balancer address list (control-plane-3).
  2. When all 300+ agents start up, they attempt to renew their client certificates using the single server that was available at the time they were stopped.
  3. This server becomes overloaded with concurrent requests, and certificate generate requests hang and time out.
  4. Because the server can be connected to, the load-balancer does not fail over to the default endpoint; and because it cannot yet connect to the apiserver (certs haven't been updated) it can't get an updated list of servers to try, anyway.

While I do believe that we could perhaps make some improvements on both the client-side load-balancer behavior, and in the server certificate generation code path, I think the more immediate fix is to avoid overloading the server by overloading by starting 300+ agents all at the same time. Stagger them randomly over a period of 1-5 minutes, or in batches of 25-50 with a minute wait in between.

@brandond
Copy link
Member

brandond commented Nov 1, 2023

The flame graph shared on slack suggests that most of the CPU time on the server is being consumed by the constant-time password comparison that is done when validating the node password against the contents of the node password secret:

profile__1_

https://github.com/k3s-io/k3s/blob/c7c339f0b7315eb3c017cc68930282e2eb7e8c75/pkg/nodepassword/nodepassword.go#L61C25-L61C25

We could probably also use a cache for the node password secrets instead of getting them directly, but that doesn't appear to be the bottle-neck, at least not according to this trace.

@Sea-you
Copy link
Author

Sea-you commented Nov 3, 2023

Thank you @brandond.

Yes, as for now the easiest way is to stagger the startups, although I believe there's room for improvement here.

As I see there are multiple methods to fix this and the easiest way is implementing ExponentialBackoff:

  • During the client certificate renewals on agent startup, there should be an exponential backoff mechanism in the getServingCert function. This would slow down the agent clients and would not need to start the agents in batches and everything would come up nicely.

Optionally - as you mentioned - the server side could pre-cache and WATCH the node-password secrets. With that, I believe the problem with throttling would mostly be gone. Increasing the default QPS there wouldn't hurt at all tho'

@brandond
Copy link
Member

brandond commented Nov 3, 2023

I'm not sure how an exponential backoff would fix this. I suppose we could time out that request a bit more aggressively, although I'm not sold on changing the retry behavior. I'll have to do some testing.

The high CPU utilization is not being caused by retrieval of the node secrets, it is coming from scrypt validation of the node password provided in the request against the version stored in the secret. It would be quite easy to wire up a cache here to pre-load the secrets so that they aren't pulled directly from the apiserver on demand, but this would not do anything to address the load associated with concurrently validating 300+ scrypt hashes.

I suspect that pre-caching them will provide a moderate performance improvement, but it's not going to solve the Thundering Herd problem that you've created by attempting to start every node in your cluster all at once.

@Sea-you
Copy link
Author

Sea-you commented Nov 7, 2023

How the exponential backoff would solve this is that instead of hammering rke2-server every 10 seconds, it'll give some time to process some requests, and if not all requests were possible to serve, after some time it'll again process some requests until everything is completed.

Also - without having deep knowledge here - what is the reason of using scrypt here instead of something else? If I understand correctly, scrypt was designed to be slow.

@brandond
Copy link
Member

brandond commented Nov 7, 2023

How the exponential backoff would solve this is that instead of hammering rke2-server every 10 seconds

I'm not sure if we want a full exponential backoff, bit we can certainly look at changing the timing. Perhaps just increasing backoff with jitter.

what is the reason of using scrypt here instead of something else?

k3s-io/k3s#2407 (comment)

@Sea-you
Copy link
Author

Sea-you commented Nov 8, 2023

I'm not sure if we want a full exponential backoff, bit we can certainly look at changing the timing. Perhaps just increasing backoff with jitter.

Yes, I think that already would make a huge difference.

@brandond
Copy link
Member

brandond commented Nov 16, 2023

This adds a 1.0 jitter factor to the retry, so it will now be somewhere random between 5 and 10 seconds. It also enables cache for the node password secret retrieval.

I'm curious if you see any difference with this. If this doesn't make a significant impact in your environment we can look at an increasing backoff on the retry, but I don't want to make things worse for other users if we can help it.

@Sea-you
Copy link
Author

Sea-you commented Nov 16, 2023

Thank you @brandond! Will you build an RC we could try on? (Preferably an 1.25 RC to not have too many changes in between)

@brandond
Copy link
Member

Not yet. Also 1.25 is end of life, although we may get one last unexpected patch for it. It is time to upgrade.

@brandond
Copy link
Member

With regards for what to test test - I think just confirming that the agent retries joins at non-fixed intervals (somewhere between 5-10 seconds, instead of exactly 5 seconds) is sufficient.

@ShylajaDevadiga
Copy link
Contributor

Validated using v1.28.4-rc1+rke2r1

Created a cluster with 3 server nodes and 3 agents. Rebooted all agent nodes at once multiple times. They came back up normally. I have not tried reproduce the issue considering the as the count of agent nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants