Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.28.3+rke2r2,rke2-server start failed #5052

Closed
liyang516 opened this issue Nov 17, 2023 · 5 comments
Closed

v1.28.3+rke2r2,rke2-server start failed #5052

liyang516 opened this issue Nov 17, 2023 · 5 comments

Comments

@liyang516
Copy link

liyang516 commented Nov 17, 2023

Environmental Info:
RKE2 Version:
v1.28.3+rke2r2

Node(s) CPU architecture, OS, and Version:
OS: Ubuntu-22.04.1
Arch:x86-64

Cluster Configuration:

# cat /etc/rancher/rke2/config.yaml 
token: demo-server
cni: none
cluster-cidr: 100.65.0.0/16
service-cidr: 100.64.0.0/16
service-node-port-range: 30000-45000
kube-apiserver-arg:
  - max-requests-inflight=2000
  - max-mutating-requests-inflight=500
  - default-not-ready-toleration-seconds=100
  - default-unreachable-toleration-seconds=100
  - tls-cipher-suites=TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384
kube-controller-manager-arg:
  - "attach-detach-reconcile-sync-period=30s"

tls-san:
  - xcloud.io

data-dir: /var/lib/rancher/rke2
disable: rke2-ingress-nginx
kubelet-arg:
  - --max-pods=200
  - "root-dir=/var/lib/kubelet"
# cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
[plugins.opt]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10020"
  enable_selinux = false
  device_ownership_from_security_context = true
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.registry.mirrors]

[plugins.cri.registry.mirrors."*"]
  endpoint = ["http://ecr.xcloud.io"]

[plugins.cri.registry.mirrors."harbor.xcloud.io"]
  endpoint = ["http://harbor.xcloud.io"]

[plugins.cri.registry.configs."ecr.xcloud.io".auth]
  username = "admin"
  password = "YgXc1@udQ1ax%2o23"
# cat /etc/rancher/rke2/registries.yaml
mirrors:
  "*":
    endpoint:
      - "http://ecr.xcloud.io"
  "harbor.xcloud.io":
    endpoint:
      - "http://harbor.xcloud.io"

configs:
  "ecr.xcloud.io":
    auth:
      username: admin # this is the registry username
      password: YgXc1@udQ1ax%2o23 # this is the registry password

Describe the bug:
rke2-server can not start, log always output "Waiting for etcd server to become available"

Nov 17 15:22:39 vm1 rke2[4120]: time="2023-11-17T15:22:39+08:00" level=info msg="Waiting for etcd server to become available"
Nov 17 15:22:39 vm1 rke2[4120]: time="2023-11-17T15:22:39+08:00" level=info msg="Waiting for API server to become available"
Nov 17 15:22:40 vm1 rke2[4120]: time="2023-11-17T15:22:40+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:45 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:22:45.563759+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:22:45 vm1 rke2[4120]: time="2023-11-17T15:22:45+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Nov 17 15:22:45 vm1 rke2[4120]: time="2023-11-17T15:22:45+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:50 vm1 rke2[4120]: time="2023-11-17T15:22:50+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:22:54 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:22:54.561536+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:22:54 vm1 rke2[4120]: {"level":"info","ts":"2023-11-17T15:22:54.561634+0800","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Nov 17 15:22:55 vm1 rke2[4120]: time="2023-11-17T15:22:55+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:23:00 vm1 rke2[4120]: {"level":"warn","ts":"2023-11-17T15:23:00.564626+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a51a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Steps To Reproduce:

  • Installed RKE2:
INSTALL_RKE2_ARTIFACT_PATH=/root/v1.28.3+rke2r2/rke2-artifacts sh install.sh

Expected behavior:
rke2-server service start succeed

Actual behavior:
rke2-server service start failed

systemctl start rke2-server.service
Job for rke2-server.service failed because the control process exited with error code.
See "systemctl status rke2-server.service" and "journalctl -xeu rke2-server.service" for details.
+ systemctl status rke2-server.service --no-page
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2023-11-17 15:24:09 CST; 25ms ago
       Docs: https://github.com/rancher/rke2#readme
    Process: 4112 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 4114 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 4115 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 4120 ExecStart=/usr/local/bin/rke2 server (code=exited, status=1/FAILURE)
    Process: 5955 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill (code=exited, status=0/SUCCESS)
   Main PID: 4120 (code=exited, status=1/FAILURE)
      Tasks: 12
     Memory: 3.7G
        CPU: 17min 23.993s
     CGroup: /system.slice/rke2-server.service
             └─5921 /var/lib/rancher/rke2/data/v1.28.3-rke2r2-0599290799e6/bin/containerd-shim-runc-v2 -namespace k8s.io -id 473017937fb9c99bc74aa02f39f130bc76be173df57d3d44db05bbb0b115f391 -address /run/k3s/containerd/containerd.sock

Other:
It seems that etcd Auto sync endpoints failed

@liyang516 liyang516 changed the title v1.28.3+rke2r2 rke2-server start failed v1.28.3+rke2r2,rke2-server start failed Nov 17, 2023
@brandond
Copy link
Member

brandond commented Nov 17, 2023

Check the pod logs under /var/log/pods and containerd log at /var/lib/rancher/rke2/agent/containerd/containerd.log to see why etcd isn't starting.

If this is airgapped, confirm that the etcd image is available on your mirror.

@Braekpo1nt
Copy link

I'm getting the same set of errors when I run the install script according to the RKE2 quickstart guide (even with the firewall turned off to see if that's the issue). I've been trying to figure out what's going on for days, but I'm stuck because this is my first time trying to run anything related to docker/kubernetes/rke2/rancher. I'll be following this thread to see what the solution is.

journalctl -u rke2-server logs:

Nov 17 15:53:56 scotts-lenovo rke2[162370]: {"level":"warn","ts":"2023-11-17T15:53:56.709478Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00075b880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Nov 17 15:53:56 scotts-lenovo rke2[162370]: {"level":"info","ts":"2023-11-17T15:53:56.709565Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Nov 17 15:53:59 scotts-lenovo rke2[162370]: time="2023-11-17T15:53:59Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Nov 17 15:53:59 scotts-lenovo rke2[162370]: time="2023-11-17T15:53:59Z" level=error msg="Kubelet exited: exit status 1"

containerd log

cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error
time="2023-11-16T21:54:45.496612330Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found in directory /lib/modules/5.15.0-88-generic\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.496894360Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.497067543Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.539304587Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2023-11-16T21:54:45.539514118Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2023-11-16T21:54:45.539589144Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2023-11-16T21:54:45.659093626Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
time="2023-11-16T21:54:45.659143336Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2023-11-16T21:54:45.659161243Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"

@Braekpo1nt
Copy link

Solution:

  • Reinstall Ubuntu 20.04 LTS on my baremetal machine
  • Disable ufw and apparmor
    sudo systemctl disable --now ufw
    sudo systemctl disable --now apparmor.service
    
  • Follow RKE2 quickstart guide again

This works as expected, and things work fine. Took me 5 days straight to figure that out.

@liyang516
Copy link
Author

liyang516 commented Nov 20, 2023

@brandond Hi,thanks for your reply.
This is my etcd logs and containerd logs, I didn't find any useful information
etcd log

# tail -f 1.log 
2023-11-20T10:54:57.308352068+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.308285Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
2023-11-20T10:54:57.308399627+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.308308Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
2023-11-20T10:54:57.311396105+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311245Z","caller":"embed/etcd.go:726","msg":"starting with client TLS","tls-info":"cert = /var/lib/rancher/rke2/server/tls/etcd/server-client.crt, key = /var/lib/rancher/rke2/server/tls/etcd/server-client.key, client-cert=, client-key=, trusted-ca = /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2023-11-20T10:54:57.311418482+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311319Z","caller":"embed/etcd.go:597","msg":"serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:54:57.311423555+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311373Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"127.0.0.1:2380"}
2023-11-20T10:54:57.311427353+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311314Z","caller":"embed/etcd.go:597","msg":"serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:54:57.311434672+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.31139Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"10.57.131.130:2380"}
2023-11-20T10:54:57.311619926+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311501Z","caller":"embed/etcd.go:278","msg":"now serving peer/client/metrics","local-member-id":"1652804ea3b95f3f","initial-advertise-peer-urls":["https://10.57.131.130:2380"],"listen-peer-urls":["https://10.57.131.130:2380","https://127.0.0.1:2380"],"advertise-client-urls":["https://10.57.131.130:2379"],"listen-client-urls":["https://10.57.131.130:2379","https://127.0.0.1:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"]}
2023-11-20T10:54:57.311691718+08:00 stderr F {"level":"info","ts":"2023-11-20T02:54:57.311538Z","caller":"embed/etcd.go:855","msg":"serving metrics","address":"http://127.0.0.1:2381"}
2023-11-20T10:55:12.309345393+08:00 stderr F {"level":"warn","ts":"2023-11-20T02:55:12.308926Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"1652804ea3b95f3f","local-member-attributes":"{Name:vm1-470fdc6d ClientURLs:[https://10.57.131.130:2379]}","request-path":"/0/members/1652804ea3b95f3f/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}



# tail 2.log 
2023-11-20T10:10:56.4017914+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:56.401601Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"127.0.0.1:2379"}
2023-11-20T10:10:56.401810415+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:56.401602Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"10.57.131.130:2379"}
2023-11-20T10:10:58.974717526+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:58.974378Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
2023-11-20T10:10:58.974770941+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:58.974526Z","caller":"embed/etcd.go:376","msg":"closing etcd server","name":"vm1-acc0de71","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://10.57.131.130:2380"],"advertise-client-urls":["https://10.57.131.130:2379"]}
2023-11-20T10:10:59.779775767+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.77964Z","caller":"etcdserver/server.go:1465","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"1652804ea3b95f3f","current-leader-member-id":"1652804ea3b95f3f"}
2023-11-20T10:10:59.786814959+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786622Z","caller":"embed/etcd.go:579","msg":"stopping serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:10:59.78694604+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786807Z","caller":"embed/etcd.go:584","msg":"stopped serving peer traffic","address":"10.57.131.130:2380"}
2023-11-20T10:10:59.786963102+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786849Z","caller":"embed/etcd.go:579","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:10:59.787017621+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786928Z","caller":"embed/etcd.go:584","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"}
2023-11-20T10:10:59.787038907+08:00 stderr F {"level":"info","ts":"2023-11-20T02:10:59.786947Z","caller":"embed/etcd.go:378","msg":"closed etcd server","name":"vm1-acc0de71","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["https://10.57.131.130:2380"],"advertise-client-urls":["https://10.57.131.130:2379"]}

@brandond
Copy link
Member

Nov 17 15:23:00 vm1 rke2[4120]: time="2023-11-17T15:23:00+08:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
2023-11-20T10:55:12.309345393+08:00 stderr F {"level":"warn","ts":"2023-11-20T02:55:12.308926Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"1652804ea3b95f3f","local-member-attributes":"{Name:vm1-470fdc6d ClientURLs:[https://10.57.131.130:2379]}","request-path":"/0/members/1652804ea3b95f3f/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}

Please confirm that you have disabled firewalld, ufw, or any other iptables-based host firewalls. It seems that both rke2 and etcd itself cannot communicate with the etcd ports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants