Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd is consistently crashing on RKE2 deployment #6824

Closed
agaxprex opened this issue Sep 16, 2024 · 1 comment
Closed

Etcd is consistently crashing on RKE2 deployment #6824

agaxprex opened this issue Sep 16, 2024 · 1 comment

Comments

@agaxprex
Copy link

Environmental Info:
RKE2 Version:
rke2 version v1.29.0+rke2r1 (4fd30c2)
go version go1.21.5 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux yellowtail 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 RKE2 Server node
SELinux Disabled

Describe the bug:
The etcd container for RKE2 starts up, but eventually segfaults before restarting. This prevents the cluster from starting as the etcd container keeps crashing.

Steps To Reproduce:
Installed RKE2 traditionally, with the following config:

---
write-kubeconfig-mode: 666
token: fqzsqm1jsbrgjhb83xad
data-dir: /var/lib/rancher/rke2
cni: ['multus', 'canal']
tls-san:
  - localhost
  - cluster.local
  - 192.168.10.2
disable: ['rke2-ingress-nginx']
node-label:
  - cs=server
etcd-arg: ''
etcd-snapshot-dir: /var/lib/rancher/rke2/server/db/snapshots
etcd-snapshot-schedule-cron: '0 */12 * * *'
etcd-snapshot-retention: 5
etcd-snapshot-compress: true
snapshotter: native
node-name: yellowtail
profile: cis
debug: true
disable-cloud-controller: true

Expected behavior:
Cluster starts up normally

Actual behavior:
Etcd segfaults. The etcd container launches normally. Containerd starts fine. Kubelet starts fine. The etcd container is relaunched and the same segfault occurs.

This is the relevant portion from journalctl -u rke2-server

Dec 31 22:18:04 yellowtail rke2[37955]: time="2021-12-31T22:18:04-05:00" level=info msg="containerd is now running"
Dec 31 22:18:04 yellowtail rke2[37955]: time="2021-12-31T22:18:04-05:00" level=debug msg="Deleting existing lease: {rke2 2022-01-01 03:17:17.103412157 +0000 UTC map[]}"
Dec 31 22:18:04 yellowtail rke2[37955]: time="2021-12-31T22:18:04-05:00" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images-canal.linux-amd64.tar.gz"
Dec 31 22:18:21 yellowtail rke2[37955]: time="2021-12-31T22:18:21-05:00" level=info msg="Pod for etcd is synced"
Dec 31 22:18:21 yellowtail rke2[37955]: time="2021-12-31T22:18:21-05:00" level=info msg="Pod for kube-apiserver is synced"
Dec 31 22:18:21 yellowtail rke2[37955]: time="2021-12-31T22:18:21-05:00" level=info msg="ETCD server is now running"
Dec 31 22:18:21 yellowtail rke2[37955]: time="2021-12-31T22:18:21-05:00" level=info msg="rke2 is up and running"
Dec 31 22:18:21 yellowtail systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Dec 31 22:18:25 yellowtail rke2[37955]: {"level":"warn","ts":"2021-12-31T22:18:25.995336-0500","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00086f500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Dec 31 22:18:25 yellowtail rke2[37955]: {"level":"info","ts":"2021-12-31T22:18:25.995483-0500","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Dec 31 22:18:34 yellowtail rke2[37955]: time="2021-12-31T22:18:34-05:00" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
Dec 31 22:18:34 yellowtail rke2[37955]: time="2021-12-31T22:18:34-05:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Dec 31 22:18:34 yellowtail rke2[37955]: time="2021-12-31T22:18:34-05:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Dec 31 22:18:34 yellowtail rke2[37955]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 31 22:18:34 yellowtail rke2[37955]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28d75cd]
Dec 31 22:18:34 yellowtail rke2[37955]: goroutine 518 [running]:
Dec 31 22:18:34 yellowtail rke2[37955]: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc0008a0654, 0x29}, {0x0, 0x0}, {0x3bbc700, 0xc001700150})
Dec 31 22:18:34 yellowtail rke2[37955]:         /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:439 +0x4d
Dec 31 22:18:34 yellowtail rke2[37955]: path/filepath.Walk({0xc0008a0654, 0x29}, 0xc0007f1038)
Dec 31 22:18:34 yellowtail rke2[37955]:         /usr/local/go/src/path/filepath/path.go:570 +0x4a
Dec 31 22:18:34 yellowtail rke2[37955]: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc000b0d0e0)
Dec 31 22:18:34 yellowtail rke2[37955]:         /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:438 +0xa7
Dec 31 22:18:34 yellowtail rke2[37955]: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc000b0d0e0, {0x3bf3f70, 0xc0009449b0})
Dec 31 22:18:34 yellowtail rke2[37955]:         /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:735 +0xd6
Dec 31 22:18:34 yellowtail rke2[37955]: github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
Dec 31 22:18:34 yellowtail rke2[37955]:         /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:110 +0x9e
Dec 31 22:18:34 yellowtail rke2[37955]: created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start in goroutine 1
Dec 31 22:18:34 yellowtail rke2[37955]:         /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:101 +0x6ad
Dec 31 22:18:34 yellowtail systemd[1]: rke2-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 31 22:18:34 yellowtail systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Dec 31 22:18:39 yellowtail systemd[1]: rke2-server.service: Service RestartSec=5s expired, scheduling restart.
Dec 31 22:18:39 yellowtail systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 20.
Dec 31 22:18:39 yellowtail systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).

Additional context / logs:

etcd config:


advertise-client-urls: https://192.168.10.2:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
experimental-initial-corrupt-check: true
heartbeat-interval: 500
initial-advertise-peer-urls: https://192.168.10.2:2380
initial-cluster: yellowtail-99ea3b8b=https://192.168.10.2:2380
initial-cluster-state: new
listen-client-http-urls: https://127.0.0.1:2382
listen-client-urls: https://127.0.0.1:2379,https://192.168.10.2:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://127.0.0.1:2380,https://192.168.10.2:2380
log-outputs:
- stderr
logger: zap
name: yellowtail-99ea3b8b
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt
snapshot-count: 10000
@brandond
Copy link
Member

rke2 version v1.29.0+rke2r1

This version of RKE2 was released in December of 2023. Please upgrade to a newer version - the latest v1.29 patch release at the very least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants