Replies: 1 comment
-
I can't say that we do a lot of random failure testing, such as taking down interfaces to see what happens. I don't feel like this is something that the node is expected to be resilient to. I will say that much of Kubernetes networking, in particular service networking managed by kube-proxy, relies on there being an interface with a default route, in order to get traffic to service ClusterIPs routed properly - even if it will stay within the cluster or node. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a 5 node HA cluster running k3s 1.27.4 as mentioned in #9854, while observing the cluster behavior by taking down network interfaces, I notice the following,
When I take down one interface down on a node, on that node's k3s log there are several timeouts as seen below
Apr 08 09:27:41 node-3 k3s[541236]: E0408 09:27:41.392816 541236 leaderelection.go:327] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
There are timeout messages while updating node status to 127.0.0.1:6443 as well.
Since this is a localhost communication why is it affected by changes in network ? Is setting an interface down causing the server thread getting stuck and not able to handle the request ?
In my setup 5 nodes are connected to each other via two leaf switches and each node has 4 interfaces and two of connected to a single leaf switch. So when I take down the second interface connected to same leaf switch as the first interface, I occasionally get "NodeNotReady" events and sometimes k3s also gets restarted.
Beta Was this translation helpful? Give feedback.
All reactions