Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemctl restart systemd-sysctl i.e. as part of cis-1.6 enablement with cililium CNI breaks outgoing communication of coredns #2021

Open
Martin-Weiss opened this issue Oct 25, 2021 · 20 comments
Assignees

Comments

@Martin-Weiss
Copy link

Martin-Weiss commented Oct 25, 2021

Update 27.10.2021 - it is not the rke2-server/rke2-agent restart breaking cilium network communication - it is systemctl restart systemd-sysctl!

Environmental Info:
RKE2 Version: 1.20.11-rke2r2

Node(s) CPU architecture, OS, and Version:
SLES 15 SP3 x86_64 within VMware ESXi

Cluster Configuration:
3 server, 4 agents, Cilium as CNI, cis-1.6 profile
non-default CIDR:
cluster-cidr: "172.27.0.0/16"
service-cidr: "172.28.0.0/16"
cluster-dns: "172.28.0.10"

Describe the bug:
Have the cluster up and running well and then restart systemd-sysctl on all servers.
Then check pods status and logs especially of coredns and see failure of coredns communicating to DNS server specified in resolv.conf on the host.

Steps To Reproduce:

Expected behavior:

  • network communication works even after restarting systemd-sysctl

Actual behavior:

  • coredns and other pods stuck in failures not being able communicate outside
  • rebooting a node fixes the problem

Some more details - this is SLES 15 SP3 having these sysctl settings:

/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1

/etc/sysctl.d/60-rke2-cis.conf
vm.panic_on_oom=0
vm.overcommit_memory=1
kernel.panic=10
kernel.panic_on_oops=1

/etc/sysctl.d/70-yast.conf
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 1

/etc/sysctl.d/99-salt.conf
#
# Kernel sysctl configuration
#
net.ipv4.ip_forward = 1
@brandond
Copy link
Member

brandond commented Oct 25, 2021

Does it do this if you only restart the service on the node where coredns is running, or does this only happen if you restart the service on ALL the nodes? Is access to other endpoints outside the cluster also affected, or does this only affect DNS? Can you attach RKE2 logs from the affected nodes?

@Martin-Weiss
Copy link
Author

Does it do this if you only restart the service on the node where coredns is running,

I just reproduced - I have a coredns pod running on master1 and do a systemctl restart rke2-server on that machine.
Right after this I can see these messages in coredns pod on that host:

[ERROR] plugin/errors: 2 rancher-logging-fluentd.cattle-logging-system.svc. A: read tcp 172.27.0.174:39330->10.101.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 rancher-logging-fluentd.cattle-logging-system.svc. A: read tcp 172.27.0.174:39332->10.101.1.1:53: i/o timeout

And 10.101.1.1 is the DNS server "outside".

or does this only happen if you restart the service on ALL the nodes? Is access to other endpoints outside the cluster also affected, or does this only affect DNS? Can you attach RKE2 logs from the affected nodes?

I just focus on DNS at the moment as without DNS in the cluster many other things break as a secondary issue..
Seems that also ./rancher2_logs_collector.sh gets stuck in this case with "Collecting rke2 system pod logs" so which logs should I collect manually?

Logs from rke2-server attached..
rke2-server-syslog.gz

@Martin-Weiss
Copy link
Author

After waiting a while - here the full logs

rke-test-master-01-2021-10-25_08_58_39.tar.gz

@manuelbuil
Copy link
Contributor

Hey Martin, I was about to try reproducing the issue and realized that your reported version (RKE2 Version: 1.21.11-rke2r2) does not exist. Did you perhaps mean v1.21.5+rke2r2?

@Martin-Weiss
Copy link
Author

Hey Martin, I was about to try reproducing the issue and realized that your reported version (RKE2 Version: 1.21.11-rke2r2) does not exist. Did you perhaps mean v1.21.5+rke2r2?

Sorry - typo - 1.20.11-rke2r2

@manuelbuil
Copy link
Contributor

Unfortunately, I don't have access to a vmware env. When trying in AWS with one server and one worker (Ubuntu 20 or SLES15-SP2), I can't reproduce the issue. Perhaps it only happens in vmware env? When the issue happens, are you able to resolve DNS hostnames from the host?

@Martin-Weiss
Copy link
Author

Martin-Weiss commented Oct 25, 2021

Unfortunately, I don't have access to a vmware env. When trying in AWS with one server and one worker (Ubuntu 20 or SLES15-SP2), I can't reproduce the issue. Perhaps it only happens in vmware env? When the issue happens, are you able to resolve DNS hostnames from the host?

Yes - on the host I can resolve..

Do you have cilium / cis-1.6 active and adjusted / no default CIDRs?

Just had an other observation - did an rke2-server restart on "master1" and the coredns pod on worker02 stopped working / I get the "read tcp 172.27.3.127:33168->10.101.1.1:53: i/o timeout", there.. something seems to break in cilium..

@manuelbuil
Copy link
Contributor

The issue does not appear when restarting rke2-server or rke2-agent. It appears when executing sudo systemctl restart systemd-sysctl. I have run your script without that line and things don't break and if I execute that line, things start to break.

I am not sure if reloading kernel parameters is supported, it might break something in the BPF code. I will ask in the Cilium community, @vadorovsky do you know?

@vadorovsky
Copy link
Contributor

Cilium is setting sysctls like:

  • net.ipv*.conf.default.forwarding
  • net.ipv*.conf.all.forwarding
  • net.ipv*.ip_local_port_range
  • net.core.fb_tunnels_only_for_init_net

And so on. I guess that systemctl restart systemd-sysctl is breaking those. Cilium is setting up those sysctls only when it's starting.

@manuelbuil
Copy link
Contributor

Cilium is setting sysctls like:

* `net.ipv*.conf.default.forwarding`

* `net.ipv*.conf.all.forwarding`

* `net.ipv*.ip_local_port_range`

* `net.core.fb_tunnels_only_for_init_net`

And so on. I guess that systemctl restart systemd-sysctl is breaking those. Cilium is setting up those sysctls only when it's starting.

Restarting the cilium node in the affected node fixes the issue too

@vadorovsky
Copy link
Contributor

I would suggest just simply not doing systemctl restart systemd-sysctl, as I see absolutely no reason to do that, especially if the goal is just to restart rke2.

@Martin-Weiss
Copy link
Author

I would suggest just simply not doing systemctl restart systemd-sysctl, as I see absolutely no reason to do that, especially if the goal is just to restart rke2.

https://docs.rke2.io/security/hardening_guide/ -> there we mention this.. so if we must not do such a restart I think we also should document that..

@vadorovsky
Copy link
Contributor

Oh, then we need to tweak the rke2 sysctl conf to be compatible with Cilium sysctls. I'll take care of that.

@vadorovsky vadorovsky self-assigned this Oct 26, 2021
@Martin-Weiss
Copy link
Author

Martin-Weiss commented Oct 26, 2021

I have forwarding enabled "globally" on container/k8s/docker/podman hosts.. in /etc/sysctl.d/70-yast.conf and /etc/sysctl.d/99-salt.conf

net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0

So I assume it might be one of these that get lost:

* `net.ipv*.ip_local_port_range`
* `net.core.fb_tunnels_only_for_init_net

Do we know what parameters cilium sets to which values so that we can add them directly to our /etc/sysctl.d/....conf ?

@brandond
Copy link
Member

brandond commented Oct 26, 2021

@Martin-Weiss can you update the issue title and steps to correctly reflect the action you were taking that caused the issue? If you're running a script that takes a bunch of actions let's be more clear about that, we were trying to reproduce this by just restarting the service as you reported and couldn't see how that would cause it.

@Martin-Weiss Martin-Weiss changed the title Restarting rke2-server or rke2-agent with cililium CNI breaks outgoing communication of coredns systemctl restart systemd-sysctl i.e. as part of cis-1.6 enablement with cililium CNI breaks outgoing communication of coredns Oct 27, 2021
@vadorovsky
Copy link
Contributor

After looking at Cilium code, I must say that sysctls set by cilium-agent are quite dynamic, they depend on configuration, so I think it's the best if we just let Cilium do its job with setting them and not modify them.

That means, I think you should remove systemctl restart systemd-sysctl from your scripts. Documentation (https://docs.rke2.io/security/hardening_guide/) doesn't say anything about this step as something which should be done on rke2 restart, but rather as an installation step. I will make a change to docs anyway to highlight the fact it's only an initial step and shouldn't be done on a running Kubernetes cluster.

@Martin-Weiss
Copy link
Author

I believe the command should be changed to sysctl --system instead of systemctl restart systemd-sysctl in general.

vadorovsky added a commit to vadorovsky/rke2 that referenced this issue Nov 11, 2021
Make it clear that setting sysctls and using systemd-sysctl should be
done only after RKE2 installation and before actual Kubernetes
deployment, because Kubernetes components or CNI plugins might modify
some sysctls on their own.

Ref: rancher#2021
Signed-off-by: Michal Rostecki <[email protected]>
vadorovsky added a commit to vadorovsky/rke2 that referenced this issue Nov 15, 2021
Make it clear that setting sysctls and using systemd-sysctl should be
done only after RKE2 installation and before actual Kubernetes
deployment, because Kubernetes components or CNI plugins might modify
some sysctls on their own.

Ref: rancher#2021
Signed-off-by: Michal Rostecki <[email protected]>
dweomer pushed a commit that referenced this issue Nov 15, 2021
#2113)

Make it clear that setting sysctls and using systemd-sysctl should be
done only after RKE2 installation and before actual Kubernetes
deployment, because Kubernetes components or CNI plugins might modify
some sysctls on their own.

Ref: #2021
Signed-off-by: Michal Rostecki <[email protected]>
@vanfroda
Copy link

vanfroda commented Jun 7, 2023

Hello,

I know this is an old thread, but it seems we are hitting the same problem.
In our case we are running

Environmental Info:
RKE2 Version: v1.24.13+rke2r1

Node(s) CPU architecture, OS, and Version:
Red Hat Enterprise Linux 8.7 x86_64 within VMware ESXi

CNI provider is calico v3.25.0

Situation:
All linux hosts in our environment are controlled by chef.
If we push new sysctl settings to our servers, in this case:

net.ipv4.tcp_keepalive_time = 45
net.ipv4.tcp_keepalive_intvl = 45
net.ipv4.tcp_keepalive_probes = 9

That works fine for all servers, except for the rancher nodes. Pods start crashing after chef is applying the settings with sysctl --system
However on AWS we also have some rancher nodes running, those are using Canal. For some reason those are not impacted.
(AWS versions: rancher 2.7.3 with RKE2 v1.24.13+rke2r1)

@endawkins
Copy link

Testing Steps

Manual Execution:

SLES SP3

ami-0f7cb53c916a75006

Configuration

  • 3 servers
  • 4 agents
server1 config.yaml

Contents:
wrtie-kubeconfig-mode: 644
cni: calico (optional)
token: <token>
profile: cis-1.23 or just cis
node-external-ip: <public_ip4_address>
server2&3 config.yaml

Contents:
wrtie-kubeconfig-mode: 644
cni: calico (optional)
token: <token>
profile: cis-1.23 or just cis
server: https://<server1_IP>:9345
agent(s) config.yaml

Contents:
token: <token>
profile: cis-1.23 or just cis
server: https://<server1_IP>:9345

Preconditions

  • the following files added to the instances (if not already added)
$ sudo vi /etc/sysctl.conf

Contents: 
net.ipv6.conf.all.disable_ipv6 = 1
$ sudo vi /etc/sysctl.d/60-rke2-cis.conf

Contents:
vm.panic_on_oom=0
vm.overcommit_memory=1
kernel.panic=10
kernel.panic_on_oops=1
$ sudo vi /etc/sysctl.d/70-yast.conf

Contents:
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 1
$ sudo vi /etc/sysctl.d/99-salt.conf

Contents:
#
# Kernel sysctl configuration
#
net.ipv4.ip_forward = 1
  • run the following commands for hardened cluster
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd; sudo modprobe ip_vs_rr; sudo modprobe ip_vs_wrr; sudo modprobe ip_vs_sh; sudo systemctl restart systemd-sysctl; sudo sysctl -p /etc/sysctl.d/60-rke2-cis.conf
$ sudo mkdir -p /etc/rancher/rke2/ && sudo cp config.yaml /etc/rancher/rke2/ && cat /etc/rancher/rke2/config.yaml
  • RKE2 Installed
$ curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.2+rke2r1 INSTALL_RKE2_CHANNEL=stable INSTALL_RKE2_TYPE="server" sh -
$ curl -sfL https://get.rke2.io | sudo RKE2_URL=https://3.15.165.154:9345 INSTALL_RKE2_VERSION=v1.28.2+rke2r1 INSTALL_RKE2_CHANNEL=stable INSTALL_RKE2_TYPE="agent" sh -

Steps

  1. Capture the coredns pods
  2. Run the following command on each of the nodes sequentially and watch the status of the coredns pods:
$ sudo sysctl --system [go through steps 3 and 4, then run the next command]
$ sudo systemctl restart systemd-sysctl
  1. Run the following commands to verify the state of the coredns:
$ kubectl describe pod <pod_name> -n kube-system
  1. Check the logs of the coredns pod:
$ kubectl logs pod/<pod_name> -n kube-system

Results/Observations:

$ kubectl describe pod rke2-coredns-rke2-coredns-67f86d96c-6sx76 -n kube-system
Name:                 rke2-coredns-rke2-coredns-67f86d96c-6sx76
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 ip-<IP_ADDRESS>/<IP_ADDRESS>
Start Time:           Fri, 06 Oct 2023 17:42:47 +0000
Labels:               app.kubernetes.io/instance=rke2-coredns
                      app.kubernetes.io/name=rke2-coredns
                      k8s-app=kube-dns
                      pod-template-hash=67f86d96c
Annotations:          checksum/config: [REDACTED]
                      cni.projectcalico.org/containerID: [REDACTED]
                      cni.projectcalico.org/podIP: [REDACTED]/32
                      cni.projectcalico.org/podIPs: [REDACTED]/32
Status:               Running
IP:                   [REDACTED]
IPs:
  IP:           [REDACTED]
Controlled By:  ReplicaSet/rke2-coredns-rke2-coredns-67f86d96c
Containers:
  coredns:
    Container ID:  containerd://[REDACTED]
    Image:         rancher/hardened-coredns:v1.10.1-build20230607
    Image ID:      docker.io/rancher/hardened-coredns@sha256:[REDACTED]
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Fri, 06 Oct 2023 17:43:14 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:        100m
      memory:     128Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=30s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l87tc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rke2-coredns-rke2-coredns
    Optional:  false
  kube-api-access-l87tc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/etcd:NoExecute op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
$ kubectl describe pod rke2-coredns-rke2-coredns-67f86d96c-d7bpc -n kube-system
Name:                 rke2-coredns-rke2-coredns-67f86d96c-d7bpc
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 ip-<IP_ADDRESS>/<IP_ADDRESS>
Start Time:           Fri, 06 Oct 2023 18:26:21 +0000
Labels:               app.kubernetes.io/instance=rke2-coredns
                      app.kubernetes.io/name=rke2-coredns
                      k8s-app=kube-dns
                      pod-template-hash=67f86d96c
Annotations:          checksum/config: [REDACTED]
                      cni.projectcalico.org/containerID: [REDACTED]
                      cni.projectcalico.org/podIP: [REDACTED]/32
                      cni.projectcalico.org/podIPs: [REDACTED]/32
Status:               Running
IP:                   [REDACTED]
IPs:
  IP:           [REDACTED]
Controlled By:  ReplicaSet/rke2-coredns-rke2-coredns-67f86d96c
Containers:
  coredns:
    Container ID:  containerd://[REDACTED]
    Image:         rancher/hardened-coredns:v1.10.1-build20230607
    Image ID:      docker.io/rancher/hardened-coredns@sha256:[REDACTED]
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Fri, 06 Oct 2023 18:26:26 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:        100m
      memory:     128Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=30s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5j69j (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rke2-coredns-rke2-coredns
    Optional:  false
  kube-api-access-5j69j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/etcd:NoExecute op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
$ kubectl logs pod/rke2-coredns-rke2-coredns-67f86d96c-6sx76 -n kube-system
.:53
[INFO] plugin/reload: Running configuration SHA512 = [REDACTED]
CoreDNS-1.10.1
linux/amd64, go1.20.4 X:boringcrypto, 055b2c31
$ kubectl logs pod/rke2-coredns-rke2-coredns-67f86d96c-d7bpc -n kube-system
.:53
[INFO] plugin/reload: Running configuration SHA512 = [REDACTED]
CoreDNS-1.10.1
linux/amd64, go1.20.4 X:boringcrypto, 055b2c31
$ sudo sysctl --system
* Applying /boot/sysctl.conf-5.3.18-59.37-default ...
kernel.hung_task_timeout_secs = 0
kernel.msgmax = 65536
kernel.msgmnb = 65536
kernel.shmmax = 0xffffffffffffffff
kernel.shmall = 0x0fffffffffffff00
vm.dirty_ratio = 20
* Applying /usr/lib/sysctl.d/50-default.conf ...
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
net.ipv6.conf.default.use_tempaddr = 1
net.ipv4.ping_group_range = 0 2147483647
fs.inotify.max_user_watches = 65536
kernel.sysrq = 184
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
kernel.kptr_restrict = 1
* Applying /usr/lib/sysctl.d/51-network.conf ...
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0
* Applying /etc/sysctl.d/60-rke2-cis.conf ...
vm.panic_on_oom = 0
vm.overcommit_memory = 1
kernel.panic = 10
kernel.panic_on_oops = 1
* Applying /etc/sysctl.d/70-yast.conf ...
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 1
* Applying /etc/sysctl.d/99-salt.conf ...
net.ipv4.ip_forward = 1
* Applying /usr/lib/sysctl.d/99-sysctl.conf ...
net.ipv6.conf.all.disable_ipv6 = 1
* Applying /etc/sysctl.conf ...
net.ipv6.conf.all.disable_ipv6 = 1

Conclusion:

No reported network issues

@endawkins endawkins reopened this Oct 6, 2023
@endawkins
Copy link

accidentally closed the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants