-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to join workers with BGP Unnumbered RFC 5549 #2323
Comments
thanks for moving this ticket here and sorry for the delay.
i should point out that i'm not familiar with " BGP Unnumbered" and with big portions of RFC 5549. having an overview here, i think this is a bug that should be fixed for 1.20, yet unclear whether we can backport it to older releases. probably not, due to the fact that the bug fix will introduce a change in behavior. to be discussed... one overall problem here is that "kubeadm join" for a worker node, should not really fetch the "InitConfiguration" from the cluster, or at least i don't see a reason why it should. for CP nodes this is needed. |
Q: in case you trying a HA setup (more than one CP nodes) are you not seeing this problem, because you are being explicit about the advertiseAddress on such CP nodes via JoinConfiguration.controlPlane? |
and another Q: can you use |
Thanks for the help here! We do have a HA setup, I'm actually unable to join controllers as well. I'm wondering if the fix mentioned in the PR I linked at the top was only for
Here's the kubeadm-join config on a controller:
Here's a worker join w/
We're currently manually patching + building kubeadm with this in order to use kubeadm: https://patch-diff.githubusercontent.com/raw/kubernetes/kubernetes/pull/69578.diff |
@eknudtson this looks like a problem in API Machinery in terms of how the transport library selects the outbound interface for connecting to the API server. We should move this issue to k/k and mark it for SIG API Machinery for resolution. Checking for unicast IP is acceptable for IPv4, but we should allow link local in the case of IPv6. |
FWIW, here's an example of our controller network interface configs:
Where we have BGP peering running off of the .300 VLAN subinterfaces. The 10.10.10.40 address is an anycast IP that functions as the control plane endpoint IP. The setup there is anycast IP -> local haproxy running on each controller -> load balancing between controller IPs. Kubeadm should select the first IPv4 on the loopback adapter, even better if we can just tell it which IP to use. |
To be sure: IPv6 in this case is only for peering with the connected routers automatically w/ BGP Unnumbered. RFC 5549 then kicks in and we advertise and receive IPv4 routes via the peering. |
hm, are you sure this --config is not a control-plane config? the warning above indicates this is try to join a CP node. |
we had a discussion about this in the kubeadm office hours meeting today. so there are a couple of issues here:
even if we make kubeadm tolerate your setup, @randomvariable had concerns that the rest of k8s will fail because they use the same utilities. so possibly we'd have to patch that code too (but that's not a kubeadm issue, per-se). |
neolit123 pipped me to the post, but going by
So, if I understand it correctly, the IPv4 routes back to the API server migrate to the IPv6 interfaces? That would explain why kubelet actually works. I admit I don't understand the RFC in detail. |
cc @aojea do you happen to know if this use case is supported (k8s-wide, see OP)? |
I'll try and clarify as best I can: On a machine, its global unicast IPv4 addresses are present on the loopback interface as /32's. Each interface that peers with the upstream router has IPv6 running with Router Advertisements and neighbor discovery enabled. Thus, each end of the link knows the fe80 (EUI64 autogenerated) link local address of the other end, and the MAC of the other end. In FRR (the routing suite we use), you can define BGP peers by their interface instead of by IP address (v4 or v6). This allows peering relationships to form without assigning IP addresses, as FRR will just use the IPv6 link locals present on each interface, knowing that the other end of the link is known as well by router advertisements + ND. Once these IPv6 BGP peering relationships are formed, you can then exchange routes between peers. IPv4 routes are exchanged and programmed in using RFC5549. If a host receives an IPv4 route, since the machine already knows the MAC of the nexthop (learned over IPv6 neighbor discovery), FRR simply programs in a dummy entry into the ARP table: Routes for 169.254.0.1 are added to the routing table by FRR for each route learned from a peer:
When a packet leaves the host via IPv4 along those default routes above (learned on each interface from the peers on the other end) it picks a path, consults the ARP table for 169.254.0.1 on the egress interface, finds that the MAC on the other end is already known (allowing us to skip an ARP lookup that would never succeed), and the packet is forwarded on its way. Essentially, it allows us to learn the routes over IPv6 and then create fake ARP entries and routes that let us send the packet to the correct place with IPv4. The IPv4s on the loopback are used as the source address for these packets, default is to pick the first one present I believe. Let me know if I can clarify or add anything else! |
I don't have any control plane config present in /etc/kubernetes/kubeadm-join.yaml :
|
Ah, ok, found the caller path finally: Thankfully, it's not the client code causing this. So, if we remove the download of the init config, then that should fix worker node joins at least. |
FYI I'm also seeing errors on
|
I watched the office hours video on this, and it would be great to be able to tell kubeadm which egress IPv4 to use if we know autodetection will fail. We have several clusters running with BGP unnumbered, and we currently patch kubeadm with https://patch-diff.githubusercontent.com/raw/kubernetes/kubernetes/pull/69578.diff to make upgrading and joining work. Outside of that, everything works well since we're able to explicitly tell other components (kubelet, apiserver) what IPv4 they should bind to. |
yes, during the meeting i was trying to explain that this is code is used in multiple places in k8s. but if you pass an explicit IP it should work, and local network clusters it something we should support..
ok, as i suspected the problem in kubeadm is deeper. during any command, the process constructs a configuration object that is passed around for common values (such as "imageRepository" for "images pull"). this object is defaulted with dynamic values from the node via the problematic SetInitDynamicDefaults function. however it does not make sense to call these defaults at all for some commands including "images pull" or "kubeadm join" (for workers). |
similar problem to: where dynamic defaults break flag overrides of --cri-socket over the value in a config. @fabriziopandini i think it's time to make the dynamic defaults apply only on demand on not by default during config fetch from cluster, config load from disk or even for commands like maybe i can find time for this in 1.20. |
I'm trying to catch up, so forgive me if I miss something
|
i had to Google BGP when we logged this ticket so i'm really not familiar with this. |
thanks the apimachinery logic that fails in this case is located here: @eknudtson showed example output above. so today during the kubeadm office hours we discussed the potential to open a ticket in k/k and ask apimachinery (in particular |
yeah, I am, so you use global address on the interface because those are always reachable. |
Yes, as said today there are two problems at stake
For 1, @eknudtson provided a possible fix, and I will ask people to have eyes on IT (this is a tricky piece of code, so more eyes, the better). For 2, my suggestion is to avoid to rush a solution, and possibly track the problem in a separate issue |
ok, i see it was linked here:
it's a bit of a mess, and sadly the problem is also present in commands such as |
Sorry, I didn't meant to assign to me, just to watch the issue
why kubeadm join from a worker has to to autodetect the IP address? Can you point me to who is calling to |
It is called from multiple locations (randomvariable listed one above). I
think one question that i have is whether the function should tollerate
this user case / setup.
|
why kubeadm join from a worker has to to autodetect the IP address?
It shouldn't. That is the issue on the kubeadm side and the interface
function is the othet problem we are looking at.
|
I just wanted to point out that BGP Unnumbered + RFC5549 do appear to be explicitly supported via kubernetes/kubernetes#83475 It appears the other functions in kubeadm weren't also fixed to work with it. |
ResolveBindAddress is also used by the kube-apiserver - if one does not pass an explicit value to its --bind-address flag, so presumably it will fail there too. i'd like to see what the owners of kube-apiserver think about the change that you did here: so potentially:
|
also please ping me and @aojea on that ticket. showing the full kubeadm output and the kubeadm use case is not directly relevant, only this part:
your interface setup and the proposed DIFF are relevant, i guess.
this is now here:
/close and sorry for the shuffle of issues, but your report surfaced a number of problems in old code. |
@neolit123: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is this a BUG REPORT or FEATURE REQUEST?
Choose one: BUG REPORT
Versions
kubeadm version (use
kubeadm version
): 1.17.12Environment:
kubectl version
): 1.17.12uname -a
): 3.10.0-1127.19.1.el7.x86_64What happened?
Related to the following:
#1156
kubernetes/kubernetes#83475
When I attempt to kubeadm join a worker with routing to the host (unicast IPv4 on the loopback, IPv6 link locals on interfaces), kubeadm fails to join workers with the following output:
verbose error:
What you expected to happen?
It looks like the fix introduced in this PR only works for control plane nodes. I'd like to be able to join workers with the node's unicast address being on the loopback interface.
How to reproduce it (as minimally and precisely as possible)?
On a node with BGP Unnumbered via RFC 5549 + a loopback unicast, run
kubeadm join --config=/etc/kubernetes/kubeadm-join.yaml
.kubeadm-join.yaml:
discovery.conf:
Anything else we need to know?
The text was updated successfully, but these errors were encountered: