-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Talos 1.8.3 advertising virtual MAC addresses #9837
Comments
First, please check MetalLB or anything else you're running on your host network. Talos only does forced advertisement for VIPs, but they are advertised with MAC address of the link. |
The pods running with Managed by Talos:
Added by us:
Via: openebs-dynamic-localpv-provisioner
For ingesting system logs into Loki, as suggested by Talos documentation here: https://www.talos.dev/v1.8/talos-guides/configuration/logging/#vector-example We are not using MetalLB as we are relying on Cloudflare tunnels for HTTP ingress, and NodePort for a pair of Postgres databases. Note that all of this has been running in our cluster since day one, and appear to be running fine on Talos 1.8.2 without triggering any MAC abuse reports. |
As we got another round of abuse reports for these servers, as well as one additional worker node which had also been upgraded to Talos 1.8.3, we have now downgraded all nodes to 1.8.2 to see if this prevents more reports from triggering. |
Any update on whether this worked for you? Just stood up a cluster on 1.8.3 in Hetzner to run into this issue... |
Yes, downgrading to 1.8.2 stopped triggering the MAC abuse reports! We have yet to try out 1.9.0 in case that has somehow rectified the issues again. Would be interesting to know. |
@m4xmorris did you try running 1.9.x yet by any chance? We'd like to upgrade but are living in fear 😅 |
Afraid the issue still appears to be present in 1.9.x🙃 Hetzner reports continued to come in after upgrading |
Same Problem here :( |
@m4xmorris @fmei-dm could you also do an inventory of pods running on hostNetwork as I did here? Maybe we'll find a pattern 🤞 |
I did some research but found no explicit root cause. What I've found is, that between talos Version 1.8.2 & 1.8.3 there is a new version of Flannel in use (ghcr.io/siderolabs/flannel:v0.25.7). See https://github.com/siderolabs/talos/releases/tag/v1.8.3 In this version of Flannel there is a new major version of netlink in use (v1.3.0) See flannel-io/flannel@bfb3669 I've scrolled through the diff between the two versions which were used in Talos 1.8.2 and Talos 1.8.3 but could not find anything which is clearly causing the problem. See here vishvananda/netlink@v1.2.1-beta.2...v1.3.0 but somehow I have the strong suspection it could be related to the new version of flannel. Regarding your question: We do not do HostNetworking at all - except the standard services which are part of the default Talos installation. Nevertheless I will have a look and post the results. |
Here are all pods on the affected cluster with Hostnetworking enabled. Since also worker-nodes are affected, only kube-proxy, kube-flannel or datadog agent can be the root cause. Since you are not using datadog-agent only kube-proxy or kube-flannel can be the root-cause. Does anyone know, if it is possible to manually downgrade flannel on talos?
|
You can by disabling Talos CNI and deploying your own Flannel they way you'd like. As a quick hack you can change versions in the DaemonSet. |
I'm no longer running a cluster in Hetzner so not able to help too much, but, I was having this issue when using Cilium (with |
Many thanks. I think I can do that - but I don't know how to reproduce the error - because I got Abuse reports for 4 out of 9 K8s nodes. So even if I get no abuse notification, I can not be sure if it worked or not. What I can try is to spawn a tcpdump on every K8s worker and listen for all ethernet frames which do not originate from the device mac address. Maybe I am lucky and capture some of the bad packets. Anybody a better idea? |
The only way Talos Linux advertises IPs from userspace is a Layer 2 VIP (not Hetzner VIP even), but it would advertise VIP address, not a random IP. My only guess is that these advertisements somehow slip from the pod networking pods, but that's a wild guess, and not sure how they get to the outbound NIC. But even if they do, totally unclear what kind of a problem it is for Hetzner, a switch should be configured to lock by MAC, so who cares what gets advertised. |
used the following command to start a tcpdump in a screen session on every worker node.
|
Found the following packets sent out on the ethernet interfaces. Findings:
|
Checked if a process was created on one of the affected kubernetes around the timestamp of the packets. Not even close. So the packets must have been emitted by a process which is already running. It seems it has nothing to do with some reconfiguration of flannel networking, when creating/evicting pods |
I was completely wrong - Flannel can not be the problem - It was not upgraded from 1.8.2 to 1.8.3. Somehow I made a mistake when comparing the versions. So I want to check if the linux kernel can be the problem. It has been upgraded from 6.6.58 to 6.6.60. I've checked the changelogs - there are many changes which could have that effect - but that are that many changes that I simply can not check all of them. @smira is it possible to downgrade the linux kernel to 6.6.58 but keep the rest of talos as is? |
You can if you do a custom build downgrading |
Thanks! I now have running talos v1.8.4 with kernel 6.6.58. Let's see what happens. Little tricky if you don't know it: You have to find the Postfix for the Packages in the commit history of Talos (branch 1.8). See sample commit: 9f62fe9 Also in v1.8.3 there were added 2 kernel modules for Block Device Caching. Had to delete those lines in hack/modules-amd64.txt
|
@fmei-dm also Talos 1.9.x is based on Linux 6.12.x, so the bug might not be there |
@m4xmorris already tried that. Same problem:
|
That's weird... Strange thing that these packets appear to be simply corrupted something (?). |
Absolutely. They seem like garbage. No MAC Address Range which is known (both source and destinaton mac), no known ethertype. I've seen some issue where the vlan tag was placed into ethertype by a bug - but that was on ubuntu - and even then - we don't use VLAN Tagging on these servers on the outgoing interface. So even then it must be some kind of bug. very weird. |
Had another idea: What if this problem is a hardware issue with this new kernel version. So I checked the used network driver in dmesg. Its using the pretty common r8169 module. Exactly this module was patched in Kernel version 6.6.59. There was a discussion about a "issues under heavy load" in https://bugzilla.kernel.org/show_bug.cgi?id=219388. https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.59:
|
Yes, this is a great find! In general, Realtek drivers are problematic. Talos 1.10 should come with Ethernet low-level configuration, which might allow you to disable e.g. checksum offloading if that might help to workaround the issue (?). |
Can't be r8169, same problem on other cluster with e1000e interfaces. meh. |
The problem seems to be gone when using the 6.6.58 kernel. So it must be a change in 6.6.59 or 6.6.60. What now? @smira: Do you have any idea how to identify the relevant patch? I've searched the kernel logs for changes which belong to the network stack - but do not have to do with a specific network hardware. I identified one which might be relevant - but have no idea if I am right. Do you guys (from talos) have contact to kernel developers for the networking stack? The patch which might be the problem is:
|
Created Kernel Bug Report: https://bugzilla.kernel.org/show_bug.cgi?id=219766 |
@fmei-dm looks like they already have identified the issue https://lore.kernel.org/netdev/[email protected]/T/ also had a hetzner claim today, give them all the details and they accepted it. Hope it lands quickly and i can upgrade to 1.10 or something, or is there a easy way to be on modern version and keep the kernel version @smira? |
pretty straight forward, if you know how. The only tricky part was to find out which Kernel Package to use. After building, the image is pushed to the docker registry of your choice. You can then reference the InstallImage in talosctl upgrade. I made the docker Image Public - I don't know if it is possible in talos to pull from a password protected registry.
|
Like I said above, Talos 1.9 is using Linux 6.12, and it's not clear for me whether it is affected or not |
@smira m4xmorris already tested it. Same problem. See linked comment. |
@smira it seems the bugfix will make it in kernel 6.14. Do you know in which Talos Version this kernel will be used? Is it possible to backport this little patch in kernel 6.12 of the official Talos Release? |
yes, as long as you have a commit ref to Linux |
Awesome! Below you can find the requested links:
|
PS: is it safe to upgrade from 1.8.2 to the latest 1.9 version? I ask because I can not upgrade to the latest minor because of this problem. |
yes |
See siderolabs/talos#9837 This causes invalid Ethernet packets to be sent out, which might trigger unrelated issues in some environments. See: * git.kernel.org/netdev/net/c/0e4427f8f587 * lore.kernel.org/netdev/[email protected]/T Signed-off-by: Andrey Smirnov <[email protected]>
Ok, the fix will be backported to the next 1.9.x release. Thanks for finding the patch! |
Bug Report
After upgrading some of our nodes from Talos 1.8.2 to 1.8.3 we've received MAC address abuse reports from Hetzner (we're deploying on Hetzner dedicated servers).
The reports state:
And proceed to give a list of unallowed MAC addresses used, a link to execute a re-check after resolving the issue, and a link to make a statement of why the MAC abuse occurred.
We've been running this Talos cluster on Hetzner for about two years and have only gotten reports like this in the last week after upgrading to Talos 1.8.3. So this upgrade is the only discrepancy we have to go on right now.
Description
Timeline:
control-plane-1
control-plane-1
from Hetznercontrol-plane-2
control-plane-1
from Hetznermixed-2
mixed-2
from HetznerOur network config is extremely simple. We run the default flannel CNI, get a public IP from Hetzner via DHCP and enable Kubespan, eg:
We've introspected the Talos network via
talosctl get links
, which does show a lot ofveth
devices, however none of them have matched the MAC addresses reported by Hetzner. We assume that the intention is for theveth
devices is to stay internal to the cluster network and not be advertised on the physical network.While our "mixed" worker is running all kinds of workloads. The 2 control plane nodes are tainted as control planes and are not running anything out of the ordinary.
When clicking the "re-check" link the report we get back is that the issue has been resolved. So these issues appears to have been transient. It's unclear if they can occur again, eg. when rebooting the nodes or similar. We don't know how to reproduce, or even how to monitor if unallowed MAC addresses are being advertised.
We don't mind spending time further troubleshooting this if someone can guide us in what to do.
For now we'll send a statement to Hetzner about the little we know, including a link to this issue, and hold off upgrading any other nodes for the time being.
Environment
1.8.3
v1.30.6
The text was updated successfully, but these errors were encountered: