-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemctl restart systemd-sysctl i.e. as part of cis-1.6 enablement with cililium CNI breaks outgoing communication of coredns #2021
Comments
Does it do this if you only restart the service on the node where coredns is running, or does this only happen if you restart the service on ALL the nodes? Is access to other endpoints outside the cluster also affected, or does this only affect DNS? Can you attach RKE2 logs from the affected nodes? |
I just reproduced - I have a coredns pod running on master1 and do a systemctl restart rke2-server on that machine. [ERROR] plugin/errors: 2 rancher-logging-fluentd.cattle-logging-system.svc. A: read tcp 172.27.0.174:39330->10.101.1.1:53: i/o timeout And 10.101.1.1 is the DNS server "outside".
I just focus on DNS at the moment as without DNS in the cluster many other things break as a secondary issue.. Logs from rke2-server attached.. |
After waiting a while - here the full logs |
Hey Martin, I was about to try reproducing the issue and realized that your reported version (RKE2 Version: 1.21.11-rke2r2) does not exist. Did you perhaps mean |
Sorry - typo - 1.20.11-rke2r2 |
Unfortunately, I don't have access to a vmware env. When trying in AWS with one server and one worker (Ubuntu 20 or SLES15-SP2), I can't reproduce the issue. Perhaps it only happens in vmware env? When the issue happens, are you able to resolve DNS hostnames from the host? |
Yes - on the host I can resolve.. Do you have cilium / cis-1.6 active and adjusted / no default CIDRs? Just had an other observation - did an rke2-server restart on "master1" and the coredns pod on worker02 stopped working / I get the "read tcp 172.27.3.127:33168->10.101.1.1:53: i/o timeout", there.. something seems to break in cilium.. |
The issue does not appear when restarting rke2-server or rke2-agent. It appears when executing I am not sure if reloading kernel parameters is supported, it might break something in the BPF code. I will ask in the Cilium community, @vadorovsky do you know? |
Cilium is setting sysctls like:
And so on. I guess that |
Restarting the cilium node in the affected node fixes the issue too |
I would suggest just simply not doing |
https://docs.rke2.io/security/hardening_guide/ -> there we mention this.. so if we must not do such a restart I think we also should document that.. |
Oh, then we need to tweak the rke2 sysctl conf to be compatible with Cilium sysctls. I'll take care of that. |
I have forwarding enabled "globally" on container/k8s/docker/podman hosts.. in /etc/sysctl.d/70-yast.conf and /etc/sysctl.d/99-salt.conf
So I assume it might be one of these that get lost:
Do we know what parameters cilium sets to which values so that we can add them directly to our /etc/sysctl.d/....conf ? |
@Martin-Weiss can you update the issue title and steps to correctly reflect the action you were taking that caused the issue? If you're running a script that takes a bunch of actions let's be more clear about that, we were trying to reproduce this by just restarting the service as you reported and couldn't see how that would cause it. |
After looking at Cilium code, I must say that sysctls set by cilium-agent are quite dynamic, they depend on configuration, so I think it's the best if we just let Cilium do its job with setting them and not modify them. That means, I think you should remove |
I believe the command should be changed to |
Make it clear that setting sysctls and using systemd-sysctl should be done only after RKE2 installation and before actual Kubernetes deployment, because Kubernetes components or CNI plugins might modify some sysctls on their own. Ref: rancher#2021 Signed-off-by: Michal Rostecki <[email protected]>
Make it clear that setting sysctls and using systemd-sysctl should be done only after RKE2 installation and before actual Kubernetes deployment, because Kubernetes components or CNI plugins might modify some sysctls on their own. Ref: rancher#2021 Signed-off-by: Michal Rostecki <[email protected]>
#2113) Make it clear that setting sysctls and using systemd-sysctl should be done only after RKE2 installation and before actual Kubernetes deployment, because Kubernetes components or CNI plugins might modify some sysctls on their own. Ref: #2021 Signed-off-by: Michal Rostecki <[email protected]>
Hello, I know this is an old thread, but it seems we are hitting the same problem. Environmental Info: Node(s) CPU architecture, OS, and Version: CNI provider is calico v3.25.0 Situation:
That works fine for all servers, except for the rancher nodes. Pods start crashing after chef is applying the settings with |
Testing StepsManual Execution:SLES SP3
Configuration
Preconditions
Steps
Results/Observations:
Conclusion:No reported network issues |
accidentally closed the issue |
Update 27.10.2021 - it is not the rke2-server/rke2-agent restart breaking cilium network communication - it is systemctl restart systemd-sysctl!
Environmental Info:
RKE2 Version: 1.20.11-rke2r2
Node(s) CPU architecture, OS, and Version:
SLES 15 SP3 x86_64 within VMware ESXi
Cluster Configuration:
3 server, 4 agents, Cilium as CNI, cis-1.6 profile
non-default CIDR:
cluster-cidr: "172.27.0.0/16"
service-cidr: "172.28.0.0/16"
cluster-dns: "172.28.0.10"
Describe the bug:
Have the cluster up and running well and then restart systemd-sysctl on all servers.
Then check pods status and logs especially of coredns and see failure of coredns communicating to DNS server specified in resolv.conf on the host.
Steps To Reproduce:
(restart based on https://docs.rke2.io/security/hardening_guide/#set-kernel-parameters)
Expected behavior:
Actual behavior:
Some more details - this is SLES 15 SP3 having these sysctl settings:
The text was updated successfully, but these errors were encountered: