etcd.yaml manifest not respecting etcd-arg parameters #6849

carpepraedam · 2024-09-20T16:51:48Z

Environmental Info:
RKE2 Version:
v1.30.4+rke2r1

Node(s) CPU architecture, OS, and Version:
inux kube-svc-m1 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 masters, 3 workers. Cluster exists on a network where static IPs are not available. All nodes get IPs via DHCP. Nodes will receive new IPs regularly due to patching and various reboots.

Describe the bug:
ETCD in RKE2 does not work with fully qualified domain names. This causes the control plane to break when control plane members have different IPs than when the cluster was originally configured. A cluster reset procedure fixes this, however I would expect RKE2 configuration to be able to support ETCD over FQDN as this is a fully supported feature of ETCD itself.

Steps To Reproduce:

Create the rke2 config file:

mkdir -p mkdir -p /etc/rancher/rke2/
cat << EOF > /etc/rancher/rke2/config.yaml
# server: https://kube-svc-m1.domain.net:9345
token: my-cluster-token
tls-san:
  - kube-svc-m1.domain.net
  - kube-svc-m2.domain.net
  - kube-svc-m3.domain.net
etcd-expose-metrics: true
etcd-snapshot-retention: 5
etcd-arg:
  - "initial-cluster=etcd1=https://kube-svc-m1.domain.net:2380,etcd2=https://kube-svc-m2.domain.net:2380,etcd3=https://kube-svc-m3.domain.net:2380"
  - "initial-advertise-peer-urls=https://kube-svc-m1.domain.net:2380"
  - "listen-peer-urls=https://0.0.0.0:2380"
  - "listen-client-urls=https://0.0.0.0:2379"
  - "advertise-client-urls=https://kube-svc-m1.domain.net:2379"
EOF

Install and enable rke2-server

curl -sfL https://get.rke2.io | sh -
systemctl enable rke2-server.service && systemctl start rke2-server.service

Server enters state where it is waiting for etcd. Check etcd logs

tail -f /var/log/containers/etcd-kube-svc-m1_kube-system_etcd-6f6e22bb993a1b5c9595107a98992f844058bcddf37aa30b2007a5793a22984a.log
....snip
2024-09-20T16:41:22.462391761Z stderr F {"level":"info","ts":"2024-09-20T16:41:22.462298Z","caller":"embed/etcd.go:375","msg":"closing etcd server","name":"kube-svc-m1-15686fed","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://kube-svc-m1.domain.net:2380"],"advertise-client-urls":["https://kube-svc-m1.domain.net:2379"]}
2024-09-20T16:41:22.462518325Z stderr F {"level":"info","ts":"2024-09-20T16:41:22.462425Z","caller":"embed/etcd.go:377","msg":"closed etcd server","name":"kube-svc-m1-15686fed","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://kube-svc-m1.domain.net:2380"],"advertise-client-urls":["https://kube-svc-m1.domain.net:2379"]}
2024-09-20T16:41:22.462655433Z stderr F {"level":"fatal","ts":"2024-09-20T16:41:22.462461Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"couldn't find local name \"kube-svc-m1-15686fed\" in the initial cluster configuration","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/main.go:31\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
...snip

Noticing that there is a etcd member "kube-svc-m1-15686fed" that is not recognized by etcd, we check the etc configuration

# /var/lib/rancher/rke2/agent/pod-manifests/etcd.yaml
root@kube-svc-m1:/var/lib/rancher/rke2/agent/pod-manifests# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    etcd.k3s.io/initial: '{"initial-advertise-peer-urls":"https://192.168.0.40:2380","initial-cluster":"kube-svc-m1-15686fed=https://192.168.0.40:2380","initial-cluster-state":"new"}'
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
  uid: 0201d9cc86c1c557a40dc8c8004e4eaf
spec:
  containers:
  - args:
    - --config-file=/var/lib/rancher/rke2/server/db/etcd/config
    command:
    - etcd
...snip

Root cause is that the pod manifest for etcd.yaml does not seem to respect the etcd-arg.initial-advertise-peer-url. This causes the etcd pod to fail because kube-svc-m1-15686fed is not a valid member defined in the etcd-arg parameters. Manually changing this does not work as the file is recreated everytime rke2 restarts

Expected behavior:
I expect to be able to set etcd-args so that etcd members can talk over FQDN which avoid the control plane failures if master nodes change IP addresses. While static IPs are nice, there is no reason RKE2 can not support this feature which ETCD already does.

Actual behavior:
RKE2 generates etcd members and IP addresses that are not in accordance with the /etc/rancher/rke2/config.yaml file.

Additional context / logs:

The text was updated successfully, but these errors were encountered:

brandond · 2024-09-20T17:23:05Z

etcd-arg:
  - "initial-cluster=etcd1=https://kube-svc-m1.domain.net:2380,etcd2=https://kube-svc-m2.domain.net:2380,etcd3=https://kube-svc-m3.domain.net:2380"
  - "initial-advertise-peer-urls=https://kube-svc-m1.domain.net:2380"
  - "listen-peer-urls=https://0.0.0.0:2380"
  - "listen-client-urls=https://0.0.0.0:2379"
  - "advertise-client-urls=https://kube-svc-m1.domain.net:2379"

Don't do that. Etcd cluster membership and advertised addresses are managed by RKE2 and you should not attempt to override these CLI args. If you want to manage your own etcd cluster, you should do so using standalone etcd, installed as a systemd service.

I would expect RKE2 configuration to be able to support ETCD over FQDN as this is a fully supported feature of ETCD itself.

We intentionally use node internal IP addresses instead of DNS names for cluster endpoints to ensure that the cluster works reliably without requiring users to have functional DNS for their node addresses. We are not currently planning on supporting use of hostnames or external IPs for managed cluster member endpoints.

brandond closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd.yaml manifest not respecting etcd-arg parameters #6849

etcd.yaml manifest not respecting etcd-arg parameters #6849

carpepraedam commented Sep 20, 2024

brandond commented Sep 20, 2024 •

edited

Loading

etcd.yaml manifest not respecting etcd-arg parameters #6849

etcd.yaml manifest not respecting etcd-arg parameters #6849

Comments

carpepraedam commented Sep 20, 2024

brandond commented Sep 20, 2024 • edited Loading

brandond commented Sep 20, 2024 •

edited

Loading