failed to recover v3 backend from snapshot is:closed #5214

sachinshakya507 · 2024-01-07T04:41:51Z

Environmental Info:
RKE2 Version: v1.24.4+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux 5.10.0-27-cloud-amd64 #1 SMP Debian 5.10.205-2 (2023-12-31) x86_64 GNU/Linux

Cluster Configuration:
3 masters 5 agents

Describe the bug:
By default etcd quota backend is 2 GB. I updated to 4 GB on all my master nodes updating config.yaml file on /etc/rancher/rke2/config.yaml.

etcd-arg:
  - "quota-backend-bytes=4294967296"

Steps To Reproduce:

Installed RKE2:

What Happened:
After updating my config.yaml file 2 of the 3 master nodes come up but one of the master node is failing to start now.

Additional context / logs:
This is my log from failing master node from directory /var/log/pods/kube-system_etcd

2024-01-07T04:25:34.916035393Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.913Z","caller":"etcdmain/config.go:339","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/var/lib/rancher/rke2/server/db/etcd/config"}
2024-01-07T04:25:34.916101706Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.913Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--config-file=/var/lib/rancher/rke2/server/db/etcd/config"]}
2024-01-07T04:25:34.916107708Z stderr F {"level":"warn","ts":"2024-01-07T04:25:34.913Z","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"config","data-dir":"/var/lib/rancher/rke2/server/db/etcd"}
2024-01-07T04:25:34.916112106Z stderr F {"level":"warn","ts":"2024-01-07T04:25:34.913Z","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"name","data-dir":"/var/lib/rancher/rke2/server/db/etcd"}
2024-01-07T04:25:34.916116384Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.913Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/rancher/rke2/server/db/etcd","dir-type":"member"}
2024-01-07T04:25:34.916123317Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.913Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":[,"https://127.0.0.1:2380"]}
2024-01-07T04:25:34.916127906Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.913Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key, client-cert=, client-key=, trusted-ca = /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2024-01-07T04:25:34.931013231Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.930Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":[,"https://127.0.0.1:2379"]}
2024-01-07T04:25:34.931230737Z stderr F {"level":"info","ts":"2024-01-07T04:25:34.931Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.16.10b7","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"rke2-dev-s1-5da27fff","data-dir":"/var/lib/rancher/rke2/server/db/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd/member","force-new-cluster":false,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://127.0.0.1:2380"],"advertise-client-urls":[],"listen-client-urls":[,"https://127.0.0.1:2379"],"listen-metrics-urls":,"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":4294967296,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
2024-01-07T04:25:35.472023137Z stderr F {"level":"info","ts":"2024-01-07T04:25:35.471Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/rancher/rke2/server/db/etcd/member/snap/db","took":"540.147935ms"}
2024-01-07T04:25:36.02201363Z stderr F {"level":"info","ts":"2024-01-07T04:25:36.021Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1843155874,"snapshot-size":"408 kB"}
2024-01-07T04:25:36.065593705Z stderr F {"level":"warn","ts":"2024-01-07T04:25:36.065Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":1843155874,"snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd/member/snap/000000006ddc53a2.snap.db","error":"snap: snapshot file doesn't exist"}
2024-01-07T04:25:36.065666Z stderr F {"level":"panic","ts":"2024-01-07T04:25:36.065Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/main.go:32\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
2024-01-07T04:25:36.067989744Z stderr F panic: failed to recover v3 backend from snapshot
2024-01-07T04:25:36.068003881Z stderr F 
2024-01-07T04:25:36.06800898Z stderr F goroutine 1 [running]:
2024-01-07T04:25:36.06801401Z stderr F go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0005103c0, 0xc0000bca80, 0x1, 0x1)
2024-01-07T04:25:36.068018357Z stderr F 	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
2024-01-07T04:25:36.068022506Z stderr F go.uber.org/zap.(*Logger).Panic(0xc00002a0f0, 0x13c0efe, 0x2a, 0xc0000bca80, 0x1, 0x1)
2024-01-07T04:25:36.068026624Z stderr F 	/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
2024-01-07T04:25:36.068031723Z stderr F go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc00018b890, 0x14, 0x0, 0x0, 0x0, 0x0, 0xc000325290, 0x1, 0x1, 0xc000324360, ...)
2024-01-07T04:25:36.068035761Z stderr F 	/go/src/go.etcd.io/etcd/server/etcdserver/server.go:515 +0x1656
2024-01-07T04:25:36.068040519Z stderr F go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000030000, 0xc000030600, 0x0, 0x0)
2024-01-07T04:25:36.068044797Z stderr F 	/go/src/go.etcd.io/etcd/server/embed/etcd.go:245 +0xef8
2024-01-07T04:25:36.068048875Z stderr F go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000030000, 0x1394c9c, 0x6, 0xc00017ac01, 0x2)
2024-01-07T04:25:36.068053123Z stderr F 	/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:228 +0x32
2024-01-07T04:25:36.068057171Z stderr F go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003c060, 0x2, 0x2)
2024-01-07T04:25:36.068061218Z stderr F 	/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:123 +0x257a
2024-01-07T04:25:36.068065186Z stderr F go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003c060, 0x2, 0x2)
2024-01-07T04:25:36.068069113Z stderr F 	/go/src/go.etcd.io/etcd/server/etcdmain/main.go:40 +0x13f
2024-01-07T04:25:36.068073872Z stderr F main.main()
2024-01-07T04:25:36.068078721Z stderr F 	go.etcd.io/etcd/server/main.go:32 +0x45

There is no any snapshots on my directory
/var/lib/rancher/rke2/server/db/snapshots
But i have this particular file
/var/lib/rancher/rke2/server/db/etcd/member/snap/000000006ddc53a2.snap.db

What could have go wrong and how can i save my failing master node

The text was updated successfully, but these errors were encountered:

brandond · 2024-01-07T06:09:49Z

You are on a very old release of rke2 that may not honor that argument properly at all times. Please upgrade to the latest v1.24 release at the very least, but preferably to a minor that is not end of life.

It also appears that you may have some corruption in your etcd datastore on one of the nodes; you may need to remove it from the cluster and rejoin it.

github-actions · 2024-03-01T20:11:22Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions bot added the status/stale label Mar 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to recover v3 backend from snapshot is:closed #5214

failed to recover v3 backend from snapshot is:closed #5214

sachinshakya507 commented Jan 7, 2024

brandond commented Jan 7, 2024 •

edited

Loading

github-actions bot commented Mar 1, 2024

failed to recover v3 backend from snapshot is:closed #5214

failed to recover v3 backend from snapshot is:closed #5214

Comments

sachinshakya507 commented Jan 7, 2024

brandond commented Jan 7, 2024 • edited Loading

github-actions bot commented Mar 1, 2024

brandond commented Jan 7, 2024 •

edited

Loading