-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent loadbalancer may deadlock when servers are removed #6208
Comments
hello @brandond -- I'm thinking that this one is possibly around code that you own thanks in advance |
I believe this is a duplicate of #5949 |
Indeed, it seems to be. Thanks! |
@brandond -- I would like to reopen this issue. I've been testing with
Details.... The errors was triggered by doing a node rolling update on control nodes only (with all nodes starting with v1.28.11+rke2r1):
Logs:
None of the IPs above correspond to CP nodes that do not exist anymore.
in these files:
On the worker nodes which are not in error, there are no comparable error in logs () these files have all current CP nodes:
|
@tmmorin can you confirm that
If all of these are true, please run your agent nodes with The behavior you're describing makes it sound like you're replacing all of the control-plane nodes before the agent's health-check fails and forces it to switch over to a new server, at which point all of the servers it was previously aware of are unavailable and it has nowhere to fall back to. There could possibly be an issue with the agent failing to fall back to the default ServerURL, because there are still servers present in the pool? But I'd need more information from you to confirm that. |
On (1) : I'll re-break my environment to check that and report here On (2) :
(not entirely sure if this is your question, but:) or are you asking about RKE2 cluster members ? (but then I'd expect you to ask about :9345 and then I would need to know how to list cluster members on RKE2 server API: is there some curl I can do to see that ?) On (3):
The rolling update started at 13:12, first node was up at 13:16, next at 13:22, last at 13:27. Cluster API does by default not wait anytime between the last created node being Ready and starting the drain of the next one. I need to search a bit, this possibly is tunable.
My questions would be:
|
The server address you are using in your agent configuration
Run
The agent load-balancers use connectivity to the websocket tunnel endpoint on a server as a health-check. When the websocket tunnel is disconnected, or the server is removed from the list, all connections using that server are closed, and clients will reconnect through the load-balancer to another server that IS passing health checks. If the server is being hard powered off by CAPI, and you have to wait for the client to get an IO Timeout from the network stack instead of the connection being actively closed when the server is stopped, then the health-checks may take longer to respond to outages. Ideally you would be stopping RKE2 on the nodes before removing them from the cluster, rather than just unceremoniously dropping them off the network.
You're not running with debug enabled, so we can't see whether or not it is trying to use it. All that your logs show is that it's trying to reconnect to the old server's websocket endpoints, because it hasn't seen them get removed from the kubernetes service's endpoints list. |
Yes, 172.20.129.32 is a virtual IP. It's managed by MetalLB which acts as the loadbalancer class for a few LB services, including one for k8s API itself and RKE2 9345:
MetaLB is essentially handling the VIP failover aspects, then once the traffic hits the node holding the VIP, the rest is essentially kube-proxy doing the work.
Well, CAPI is (ceremoniously) doing a drain, then deleting the Node from k8s, then the node is shutdown via (in my case) OpenStack API which I presume triggers a systemd-based system shutdown (but I would need to check that). At this point I think we'll need to loop with the team working on https://github.com/rancher-sandbox/cluster-api-provider-rke2 - they may have ideas on this topic. /cc @alexander-demicev @belgaied2 @Danil-Grigorev @furkatgofurov7
As a side note : a server being brutally shutdown is something that can happen, and that I would typically expect a healthcheck mechanism to have some form of keepalives to react faster than how long it takes TCP to react.
Ah, I get it. So I gather that seeing the error logs I see does not necessarily mean that the agent has no connection to the control plane nodes. But then the next question would be: why does it get stuck in NotReady ? |
I just did a run, and here is the information I gathered (I started the watch while one CP node already had been introduced).
So it seems like the changes always add the endpoint for a new node before removing the old one.
One question though: in a baremetal environment, where we typical not always have spare nodes, a control plane node rolling update will possibly happen by first draining/deleting a CP node and only after this recreating one (on the same baremetal server). Is this something that would be problematic for RKE2 ? Also, on the server that ended up NotReady, I checked what you asked about in a previous comment:
|
Well, debug logs and kubeconfigs should show more... but a couple different things could be happening:
Grab that info when you can. |
Here are the RKE2 logs from a worker node, taken during CP node rolling update. The rolling update was triggerred around 10:19. Before the rolling update, the CP nodes were:
After the rolling update:
Here are the logs of management-cluster-md-md0-be39655a08-q46b9 (172.20.129.240): After the rolling update, I have this on that node (2 old/invalid IPs, and 2 current/valid ones):
|
This surprises me -- one the old CP nodes address still appears in the RKE2 logs, even after log lines which indicates that RKE2 saw this CP node go away (172.20.129.145):
So:
It does not seem right at all that 172.20.129.145 would still appear even though it was removed ? EDIT: this seems to possibly be the reason - https://github.com/k3s-io/k3s/blame/aa4794b37223156c5f651d94e23670bd7e581607/pkg/agent/loadbalancer/servers.go#L88 // Don't delete the default server from the server map, in case we need to fall back to it.
if removedServer != lb.defaultServerAddress {
delete(lb.servers, removedServer)
} Also, I can't help wonder why the agent load-balancer isn't using the VIP as default (contrarily to the api-server LB which has the VIP as "default")... EDIT (after reading bits of code around https://github.com/k3s-io/k3s/blob/master/pkg/agent/loadbalancer/loadbalancer.go): I'm far from having a good grasp of the k3s proxy/loadbalancer code, but I'm thinking the answer may be related to how loadbalancer.SetDefault and proxy.SetSupervisorDefault are called with per-server addresses EDIT: after checking logs:
So the default agent load-balancer was set from https://github.com/k3s-io/k3s/blob/aa4794b37223156c5f651d94e23670bd7e581607/pkg/agent/tunnel/tunnel.go#L131 to the address of one of the API servers known at the time. |
My recap of the timeline of this log would be:
these ones repeat over and over
|
I also want to share another observation I made, on CP nodes. I fully realize that it may be noise, and not relate to this issue, but it's sufficiently correlated (in time) and surprising, to be worth mentioning. Around the time of the rolling update, I see this in rke2 logs:
(the times at which these error pop up match the times where nodes appear/disappear) |
To try to naildown why kubelet isn't able to connect to https://localhost:6443, I tried with curl.
this give no response, until it hits I let it run in a terminal In another terminal:
This indicates that the rke2 process isn't really consuming the data on this socket. Doing this again while doing a tcpdump capture gives me a packet dump where I see:
This seems to indicate that the connection would be stuck somewhere in |
@tmmorin Hi, there is likely a bug in CAPRKE2. If you're using |
I'm not sure I follow you @alexander-demicev We do use
And the
This IP is then properly used by the agent to do it's initial connections to the cluster. It's only later that the default address for the supervisor agent load-balancer to become a node-specific address:
This just does not seem right, and the place where this happen seems to be pretty clear:
Now, what I do not get is if (or how) this would related to the root of our issue: the fact that kubelet can't talk to https://127.0.0.1:6443 anymore, resulting in the node becoming NotReady soon after. |
I realize that I should have provided more context: This issue is seen in the context of Sylva project, where we have been successfully testing and deploying RKE2-based clusters, and doing node rolling updates. This has been working incrementally better through the past year, and reached a fairly satisfying maturity months ago. Despite sometimes hitting some issues (e.g. side-effects of #5614), in the past months we've been experiencing a lot of successful Cluster API node rolling updates, without ever hitting the issue I describe here. This issue started to pop in Sylva with the integration of 1.28.9 (to replace 1.28.8). There was in this time frame no significant change that we could relate to this issue, around how we use Cluster API, configure the Cluster API bootstrap provider for RKE2, or how fast we shutdown nodes. As a formal confirmation, I just tried today to reproduce the issue with RKE2 1.28.8 on worker nodes -- doing the exact same things, in the exact same setup (control nodes remaining in RKE2 1.28.11-rc5 that I had in the past days), and I can't reproduce the issue with RKE2 1.28.8 on worker nodes, which leads me to believe the issue was introduced in 1.28.9. |
I took current I did this the quick'n'dirty way, but I observe much more logs of "before lock" than "after lock".
These places were touched by k3s-io/k3s#9757 which was introduced in 1.28.9. I wanted to share this with you, altough being very possibly (likely!) off the mark. Please take this with a grain of salt. I clean up my debug log addition and share complete logs tomorrow. |
@tmmorin Sorry, I only had a chance to get a better look at the issue now. I don't have much RKE2 internal knowledge but my impression was that workers would always use the VIP. It's up to the CAPI infra provider to shutdown the machine. In my experience, CAPI providers usually terminate instances gracefully. We can make a change in CAPRKE2 and add a systemd service to the user-data, that will stop the RKE2 service on shutdown. @brandond Can this help nodes discover the updated endpoint list? |
The NA team is all at an onsite through Wednesday, so I'm tied up with travel and meetings and won't have time to sit down and focus on this until later in the week. Any fixes wouldn't be merged until the July cycle anyway. |
Ok, point taken. For Sylva, we'll downgrade to 1.28.8 and see how it goes. We had been seeing #5614 occur on our rolling updates, so it's hard to quantify how often this issue will trigger, but we're likely to be back to the situation of ~3 months ago. Let's also hope we won't have issues solved between Kubernetes 1.28.8 and 1.28.9 suddenly popping up. |
Can you share the spec for the If you could also provide full debug-level logs from an rke2 agent from startup, through the period of a control-plane node rotation that disrupts the agent's connection to the server, that would be helpful. I'm not able to follow the full state of the loadbalancer that seems to be triggering this behavior with logs that show only its state during the rotation. One potential oddity that I am seeing is that we do not update the default address for the apiserver loadbalancer, just the supervisor loadbalancer. I need to do some thinking about that current logic for that. There are a couple different cases that I want to make sure we handle properly:
No, it won't consume any data on this socket until after it is able to dial a connection to the backend and bidirectional pipes are set up to relay data back and forth. If it can't dial any backend the connection should eventually be closed by the server, but not until it tries all the servers in the list and fails - which it seems like might take a while in your environment. The logs don't provide enough information to make it clear how long the dials take before getting a |
Sorry for the delayed answer, we were busy preparing a release (based on 1.28.8) Thanks for your answers on my point on "This indicates that the rke2 process isn't really consuming the data on this socket".
Yes, I will. I could not find time to work on this issue in the past 2 weeks, but I plan to run a reproducer will an RKE2 binary based on last 1.28.11-rc with cleaner logs (including logs on time taken to acquire lock, just in case this proves useful).
Yes, I noticed this as well. I can't see how it can make sense to set the default supervisor loadbalancer to a node IP though: node IPs are meant to change, so this default address may at some point become stale (and will become stale during a full control plane node rolling update). The other thing I noticed is that on the default address of an LB, the healthCheck seems to always be the "return true" function. I'm wondering if this may not defeat the check of backend health done in dialContext (https://github.com/k3s-io/k3s/blob/1c9cffd97fd5f85a5c2c36e86c976e20820cb084/pkg/agent/loadbalancer/loadbalancer.go#L178). Then there is also the fact that the default address is never removed from the list of servers (https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L94), If we take the 3 things combined, it seems to me (taken for granted that I don't know much about k3s code) that this may result in keeping in the server list a stale server, breaking the ability to fallback on another one.
Let me focus on one scenario that you raised:
Two things: Assuming that the agent keeps getting updates from k8s API on endpoints, I understand that it should keep populating loadbalancer server backends with the addresses of new nodes that come up... so it should not ever need to fallback to the VIP Now, if it was to ever fallback on the VIP (but see below, I did not observe this), then in our setup, it would not necessarily always work because kube-proxy sets up local iptables rules that bypass the normal IP forwarding to the VIP (where the traffic is forwarded to whoever answers ARP queries for the VIP), with DNAT rules pointing to CP node addresses -- this will give working connectivity to the VIP only if these addresses are still valid, but will not if these IP are stale, which might occur if kube-proxy has lost access to k8s (to have such access it relies on RKE2 localhost:6443, so there is a possible deadlock here) ... however by checking counters of kube-proxy iptables rules ( This last issue would be a robustness problem - we would hit this deadlock if a server is disconnected from the network during a full CP node rolling update and reconnected afterwards: it will have no valid node address, and the VIP forwarding will be broken. And only a flush of these iptables rules will break the deadlock. The solution we have in mind to solve this is to switch to Kube-VIP. I'll try to find time tomorrow to give you more logs to chew on. Please be aware that I'll be off for a few weeks starting from tomorrow EOB; my colleague Rémi Le Trocquer has followed the topic with me and will take over. |
Yes, by design the default health-check just returns true, to ensure that endpoints aren't marked failed before their health-check callbacks are registered. The health checks are registered later, when the health check goroutines are started.
If the default server address is a LB VIP and is not associated with an actual node endpoint to provide a health-check, then it is assumed to always be healthy - which is what we want.
Ahhh, that is interesting. Usually when we deploy with an LB for the fixed registration endpoint we use standalone haproxy, an ELB endpoint, or DNS alias; I've not tried using kube-vip which WOULD interact with kube-proxy due to use of a Kubernetes service to expose the endpoint. I suspect that you are probably on to something with this. |
xref:
You might try setting As the loadbalancer status is usually managed by the load-balancer controller, this may need to be supported by kube-vip to avoid having kube-vip remove that field from the status. |
OK, so ignore that. With the kube-vip manifest from #6304 I was able to reproduce this by dropping the network on a the server that an agent was connected to. The kubelet and rke2 agent process reconnected to another server, but then hung, and all further attempts to connect through the apiserver load-balancer timed out. You were correct that it is a locking issue, caused by use of reentrant use of a rwmutex read lock. I will try to get a fix in for next week's releases. |
@brandond thanks for the efforts! |
Thank you @brandond for nailing this down 👍 I was going to post a reproducer log, but I'll just share this as a confirmation that there is locking involved, even if I understand that at this stage you don't need this:
|
Well, I'm not sure:
On (1) I would like to ask you: could we remove this behavior, at least optionally, so that when our server URL is a VIP we could tell RKE2 to never set the default address to a node address (since those may become stale at the next rolling update) ? This is highly related to the code at https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L94 .
One misunderstanding here: the issue I explain happens with MetalLB because it does rely on k8s services and kube-proxy, but would not happen with kube-vip (which does not)
Interesting! So as said above, in our context that would have to be supported by MetalLB, but after checking code and github issues, I find no trace of support for status.loadBalancer.ingress.ipMode in MetalLB. |
I will take a look at that for the next release cycle. I am not sure that this is as critical of an issue, as the supervisor lb isn't actually used for anything after the agent has started. Once everything is up and running, the only thing the agent needs a connection to is the apiserver - the supervisor addresses targeted by the remotedialer tunnel are all seeded based on the endpoints returned by the kubernetes apiserver. When you restart the agent, the VIP is restored as the default server address since that is what the --server value is set to, so VIP's presence is never really missed. |
Ah interesting to know. Then indeed I understand that this is only of a very relative priority. Still not nice to see old unused IPs popping up in logs... ;-) |
I think I can actually get that in now: k3s-io/k3s#10511 |
Closing based on: #6321 |
Environmental Info:
RKE2 Version: 1.28.9+rke2r1
Cluster Configuration:
Context:
The context is an RKE2 cluster deployed and upgraded using Cluster API (using https://github.com/rancher-sandbox/cluster-api-provider-rke2)
This implies that on some operations (RKE2 upgrade, OS upgrade, etc.) all nodes are replaced: old nodes are drained and deleted while new node are created (one at a time for CP nodes, possibly more than one at a time for worker). This is called a "node rolling update" and similar as Pod replacements in a ReplicaSet.
Describe the bug:
After a node rolling update, I observe that some of the old nodes (one that would be replaced by the rolling update) are stuck in "NotReady,SchedulingDisabled" state.
The RKE2 logs show the following error repeated over and over:
more readable extract of the end of lines:
What is striking is that the IP which is not reachable is not an IP of any node, it's actually an IP of a CP nodes that has already been drained and deleted (the underlying VM does not exist anymore) but which still exist as a Kubernetes node. (I think I observed occurrences of this issue where connection attempts where made to an IP which wasn't an IP of any currently existing Node).
On disk I find the following:
The currently existing nodes are:
So the
rke2-*-load-balancer.json
files have:When the RKE2 agent is in such a state, the only solution I found was to restart it, and this is sufficient to have it join properly the cluster. Waiting, even long (tens of hours), isn't sufficient.
The questions would be:
rke2-*-load-balancer.json
files ?rke2-*-load-balancer.json
files ?Expected behavior:
We would need an already existing node to keep working fine even when some or all CP nodes are drained/deleted/recreated during a Cluster API node rolling update.
Additional elements:
Details on node creation times:
(other workers omitted from the list)
Alternative scenario:
I'll report more if I can reproduce it, but I'm pretty sure I observed cases where the
rke2-*-load-balancer.json
files had no currently existing control plane nodes, and where only listing old/deleted ones. I suspect this would possibly be observed on a worker node that has been created before the rolling update, and the CP node rollout completes before the worker node is replaced.I'll try to reproduce this variant and report back here.
Reference:
I've wondered if k3s-io/k3s#10241 would not be remotely related (in the issue I report, it's also about a single remaining server being attempted).
The text was updated successfully, but these errors were encountered: