v1.28.3+rke2r2，rke2-server start failed #5052

liyang516 · 2023-11-17T07:27:14Z

Environmental Info:
RKE2 Version:
v1.28.3+rke2r2

Node(s) CPU architecture, OS, and Version:
OS： Ubuntu-22.04.1
Arch：x86-64

Cluster Configuration:

# cat /etc/rancher/rke2/config.yaml 
token: demo-server
cni: none
cluster-cidr: 100.65.0.0/16
service-cidr: 100.64.0.0/16
service-node-port-range: 30000-45000
kube-apiserver-arg:
  - max-requests-inflight=2000
  - max-mutating-requests-inflight=500
  - default-not-ready-toleration-seconds=100
  - default-unreachable-toleration-seconds=100
  - tls-cipher-suites=TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384
kube-controller-manager-arg:
  - "attach-detach-reconcile-sync-period=30s"

tls-san:
  - xcloud.io

data-dir: /var/lib/rancher/rke2
disable: rke2-ingress-nginx
kubelet-arg:
  - --max-pods=200
  - "root-dir=/var/lib/kubelet"

# cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
[plugins.opt]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10020"
  enable_selinux = false
  device_ownership_from_security_context = true
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.registry.mirrors]

[plugins.cri.registry.mirrors."*"]
  endpoint = ["http://ecr.xcloud.io"]

[plugins.cri.registry.mirrors."harbor.xcloud.io"]
  endpoint = ["http://harbor.xcloud.io"]

[plugins.cri.registry.configs."ecr.xcloud.io".auth]
  username = "admin"
  password = "YgXc1@udQ1ax%2o23"

# cat /etc/rancher/rke2/registries.yaml
mirrors:
  "*":
    endpoint:
      - "http://ecr.xcloud.io"
  "harbor.xcloud.io":
    endpoint:
      - "http://harbor.xcloud.io"

configs:
  "ecr.xcloud.io":
    auth:
      username: admin # this is the registry username
      password: YgXc1@udQ1ax%2o23 # this is the registry password

Describe the bug:
rke2-server can not start, log always output "Waiting for etcd server to become available"

Nov 17 15:22:39 vm1 rke2[4120]: time="2023-11-17T15:22:39+08:00" level=info msg="Waiting for etcd server to become available"
Nov 17 15:22:39 vm1 rke2[4120]: time="2023-11-17T15:22:39+08:00" level=info msg="Waiting for API server to become available"
Nov 17 15:22:40 vm1 rke2[4120]: time="2023-11-17T15:22:40+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:45 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:22:45.563759+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:22:45 vm1 rke2[4120]: time="2023-11-17T15:22:45+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Nov 17 15:22:45 vm1 rke2[4120]: time="2023-11-17T15:22:45+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:50 vm1 rke2[4120]: time="2023-11-17T15:22:50+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:54 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:22:54.561536+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:22:54 vm1 rke2[4120]: {"level":"info","ts":"2023-11-17T15:22:54.561634+0800","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Nov 17 15:22:55 vm1 rke2[4120]: time="2023-11-17T15:22:55+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:23:00 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:23:00.564626+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Steps To Reproduce:

Installed RKE2:

INSTALL_RKE2_ARTIFACT_PATH=/root/v1.28.3+rke2r2/rke2-artifacts sh install.sh

Expected behavior:
rke2-server service start succeed

Actual behavior:
rke2-server service start failed

systemctl start rke2-server.service
Job for rke2-server.service failed because the control process exited with error code.
See "systemctl status rke2-server.service" and "journalctl -xeu rke2-server.service" for details.
+ systemctl status rke2-server.service --no-page
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2023-11-17 15:24:09 CST; 25ms ago
       Docs: https://github.com/rancher/rke2#readme
    Process: 4112 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 4114 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 4115 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 4120 ExecStart=/usr/local/bin/rke2 server (code=exited, status=1/FAILURE)
    Process: 5955 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill (code=exited, status=0/SUCCESS)
   Main PID: 4120 (code=exited, status=1/FAILURE)
      Tasks: 12
     Memory: 3.7G
        CPU: 17min 23.993s
     CGroup: /system.slice/rke2-server.service
             └─5921 /var/lib/rancher/rke2/data/v1.28.3-rke2r2-0599290799e6/bin/containerd-shim-runc-v2 -namespace k8s.io -id 473017937fb9c99bc74aa02f39f130bc76be173df57d3d44db05bbb0b115f391 -address /run/k3s/containerd/containerd.sock

Other:
It seems that etcd Auto sync endpoints failed

brandond · 2023-11-17T10:17:49Z

Check the pod logs under /var/log/pods and containerd log at /var/lib/rancher/rke2/agent/containerd/containerd.log to see why etcd isn't starting.

If this is airgapped, confirm that the etcd image is available on your mirror.

Braekpo1nt · 2023-11-17T16:26:39Z

I'm getting the same set of errors when I run the install script according to the RKE2 quickstart guide (even with the firewall turned off to see if that's the issue). I've been trying to figure out what's going on for days, but I'm stuck because this is my first time trying to run anything related to docker/kubernetes/rke2/rancher. I'll be following this thread to see what the solution is.

journalctl -u rke2-server logs:

Nov 17 15:53:56 scotts-lenovo rke2[162370]: {"level":"warn","ts":"2023-11-17T15:53:56.709478Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00075b880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:53:56 scotts-lenovo rke2[162370]: {"level":"info","ts":"2023-11-17T15:53:56.709565Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Nov 17 15:53:59 scotts-lenovo rke2[162370]: time="2023-11-17T15:53:59Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:53:59 scotts-lenovo rke2[162370]: time="2023-11-17T15:53:59Z" level=error msg="Kubelet exited: exit status 1"

containerd log

cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error
time="2023-11-16T21:54:45.496612330Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found in directory /lib/modules/5.15.0-88-generic\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.496894360Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.497067543Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.539304587Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2023-11-16T21:54:45.539514118Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.539589144Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2023-11-16T21:54:45.659093626Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
time="2023-11-16T21:54:45.659143336Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2023-11-16T21:54:45.659161243Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"

Braekpo1nt · 2023-11-17T20:54:30Z

Solution:

Reinstall Ubuntu 20.04 LTS on my baremetal machine

Disable ufw and apparmor

sudo systemctl disable --now ufw
sudo systemctl disable --now apparmor.service

Follow RKE2 quickstart guide again

This works as expected, and things work fine. Took me 5 days straight to figure that out.

liyang516 · 2023-11-20T02:31:20Z

@brandond Hi，thanks for your reply.
This is my etcd logs and containerd logs, I didn't find any useful information
etcd log

# tail -f 1.log 
2023-11-20T10:54:57.308352068+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.308285Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
2023-11-20T10:54:57.308399627+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.308308Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
2023-11-20T10:54:57.311396105+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311245Z","caller":"embed/etcd.go:726","msg":"starting with client TLS","tls-info":"cert = /var/lib/rancher/rke2/server/tls/etcd/server-client.crt, key = /var/lib/rancher/rke2/server/tls/etcd/server-client.key, client-cert=, client-key=, trusted-ca = /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2023-11-20T10:54:57.311418482+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311319Z","caller":"embed/etcd.go:597","msg":"serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:54:57.311423555+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311373Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"127.0.0.1:2380"}
2023-11-20T10:54:57.311427353+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311314Z","caller":"embed/etcd.go:597","msg":"serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:54:57.311434672+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.31139Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"10.57.131.130:2380"}
2023-11-20T10:54:57.311619926+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311501Z","caller":"embed/etcd.go:278","msg":"now serving peer/client/metrics","local-member-id":"1652804ea3b95f3f","initial-advertise-peer-urls":["https://10.57.131.130:2380"],"listen-peer-urls":["https://10.57.131.130:2380","https://127.0.0.1:2380"],"advertise-client-urls":["https://10.57.131.130:2379"],"listen-client-urls":["https://10.57.131.130:2379","https://127.0.0.1:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"]}
2023-11-20T10:54:57.311691718+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311538Z","caller":"embed/etcd.go:855","msg":"serving metrics","address":"http://127.0.0.1:2381"}
2023-11-20T10:55:12.309345393+08:00 stderr F {"level":"warn","ts":"2023-11-20T02:55:12.308926Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"1652804ea3b95f3f","local-member-attributes":"{Name:vm1-470fdc6d ClientURLs:[https://10.57.131.130:2379]}","request-path":"/0/members/1652804ea3b95f3f/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}



# tail 2.log 
2023-11-20T10:10:56.4017914+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:56.401601Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"127.0.0.1:2379"}
2023-11-20T10:10:56.401810415+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:56.401602Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"10.57.131.130:2379"}
2023-11-20T10:10:58.974717526+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:58.974378Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
2023-11-20T10:10:58.974770941+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:58.974526Z","caller":"embed/etcd.go:376","msg":"closing etcd server","name":"vm1-acc0de71","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://10.57.131.130:2380"],"advertise-client-urls":["https://10.57.131.130:2379"]}
2023-11-20T10:10:59.779775767+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.77964Z","caller":"etcdserver/server.go:1465","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"1652804ea3b95f3f","current-leader-member-id":"1652804ea3b95f3f"}
2023-11-20T10:10:59.786814959+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786622Z","caller":"embed/etcd.go:579","msg":"stopping serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:10:59.78694604+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786807Z","caller":"embed/etcd.go:584","msg":"stopped serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:10:59.786963102+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786849Z","caller":"embed/etcd.go:579","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:10:59.787017621+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786928Z","caller":"embed/etcd.go:584","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:10:59.787038907+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786947Z","caller":"embed/etcd.go:378","msg":"closed etcd server","name":"vm1-acc0de71","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://10.57.131.130:2380"],"advertise-client-urls":["https://10.57.131.130:2379"]}

brandond · 2023-11-20T19:32:27Z

Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
2023-11-20T10:55:12.309345393+08:00 stderr F {"level":"warn","ts":"2023-11-20T02:55:12.308926Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"1652804ea3b95f3f","local-member-attributes":"{Name:vm1-470fdc6d ClientURLs:[https://10.57.131.130:2379]}","request-path":"/0/members/1652804ea3b95f3f/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}

Please confirm that you have disabled firewalld, ufw, or any other iptables-based host firewalls. It seems that both rke2 and etcd itself cannot communicate with the etcd ports.

liyang516 changed the title ~~v1.28.3+rke2r2 rke2-server start failed~~ v1.28.3+rke2r2，rke2-server start failed Nov 17, 2023

liyang516 closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.28.3+rke2r2，rke2-server start failed #5052

v1.28.3+rke2r2，rke2-server start failed #5052

liyang516 commented Nov 17, 2023 •

edited

Loading

brandond commented Nov 17, 2023 •

edited

Loading

Braekpo1nt commented Nov 17, 2023

Braekpo1nt commented Nov 17, 2023

liyang516 commented Nov 20, 2023 •

edited

Loading

brandond commented Nov 20, 2023

v1.28.3+rke2r2，rke2-server start failed #5052

v1.28.3+rke2r2，rke2-server start failed #5052

Comments

liyang516 commented Nov 17, 2023 • edited Loading

brandond commented Nov 17, 2023 • edited Loading

Braekpo1nt commented Nov 17, 2023

Braekpo1nt commented Nov 17, 2023

liyang516 commented Nov 20, 2023 • edited Loading

brandond commented Nov 20, 2023

liyang516 commented Nov 17, 2023 •

edited

Loading

brandond commented Nov 17, 2023 •

edited

Loading

liyang516 commented Nov 20, 2023 •

edited

Loading