Segmentation Fault on running rke2 etcd-snapshot save #4942

aganesh-suse · 2023-10-21T01:17:52Z

Environmental Info:
RKE2 Version:

rke2 -v
rke2 version v1.26.10-rc2+rke2r1 (825e3188d273e7271a0b5ce924d42455b4d37a34)
go version go1.20.10 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04

$ uname -a
Linux ip-172-31-27-121 5.19.0-1025-aws #26~22.04.1-Ubuntu SMP Mon Apr 24 01:58:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server/1 agent

Describe the bug:

Hit a segmentation fault in one of the etcd-snapshot save operations:

$ sudo rke2 etcd-snapshot save 
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2023-10-21T00:46:01Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"
{"level":"info","ts":"2023-10-21T00:46:01.02135Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161.part"}
{"level":"info","ts":"2023-10-21T00:46:01.057649Z","logger":"client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-10-21T00:46:01.057715Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2023-10-21T00:46:02.329315Z","logger":"client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-10-21T00:46:02.410492Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"19 MB","took":"1 second ago"}
{"level":"info","ts":"2023-10-21T00:46:02.410993Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"}
time="2023-10-21T00:46:02Z" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:02Z" level=info msg="Checking if S3 bucket xxxx exists"
time="2023-10-21T00:46:02Z" level=info msg="S3 bucket xxxx exists"
time="2023-10-21T00:46:02Z" level=info msg="Saving etcd snapshot on-demand-ip-x-x-x-x-1697849161 to S3"
time="2023-10-21T00:46:02Z" level=info msg="Uploading snapshot to s3://xxxx//var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="Uploaded snapshot metadata s3://xxxx//var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="S3 upload complete for on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="Reconciling ETCDSnapshotFile resources"
time="2023-10-21T00:46:03Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28d4dcf]

goroutine 1 [running]:
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000f0eae0, 0x58}, {0x0, 0x0}, {0x3b4fa60, 0xc000f10ab0})
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:437 +0x8f
path/filepath.walk({0xc000ff8570, 0x29}, {0x3b86828, 0xc0010d01a0}, 0xc0012cc460)
        /usr/local/go/src/path/filepath/path.go:508 +0x1f3
path/filepath.Walk({0xc000ff8570, 0x29}, 0xc0012cc460)
        /usr/local/go/src/path/filepath/path.go:579 +0x6c
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0003cb630)
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:436 +0xad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0003cb630, {0x3b7ca68, 0xc0009e0460})
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:732 +0xe9
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).Snapshot(0xc0003cb630, {0x3b7ca68, 0xc0009e0460})
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:382 +0x1439
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.save(0xc0009c7ce0, 0xc0009bdbc8?)
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/etcdsnapshot/etcd_snapshot.go:127 +0x92
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.Save(0xc0009c7ce0?)
        /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/etcdsnapshot/etcd_snapshot.go:110 +0x45
github.com/urfave/cli.HandleAction({0x2fde0c0?, 0x37d9d00?}, 0x4?)
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x3643998, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:175 +0x67b
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000491880, 0xc0009c7a20)
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:405 +0xe87
github.com/urfave/cli.Command.startApp({{0x365b53c, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:380 +0xb7f
github.com/urfave/cli.Command.Run({{0x365b53c, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:103 +0x845
github.com/urfave/cli.(*App).Run(0xc0004916c0, {0xc000839d40, 0x9, 0x9})
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:277 +0xb87
main.main()
        /source/main.go:23 +0x97e```

Steps To Reproduce:

Hit this while testing etcd snapshot operations for: [Release-1.26] - [BUG] rke2-etcd-snapshots configmap grows too large when snapshot retention is high #4877
In a script, I did the following ops after rke2 install, and applying the extra metadata for configmap:

$ for (( I=0; I < 5; I++ )); do sudo rke2 etcd-snapshot save ; done
sudo rke2 etcd-snapshot prune --snapshot-retention 3
sudo rke2 etcd-snapshot delete
Hit the seg fault at loop 4 of the save operation above.

Also, the cron is set in the config.yaml to take a snapshot every 1 min, and retain only 2 snapshots. So, there is a save/delete snapshot happening every minute.

The text was updated successfully, but these errors were encountered:

brandond · 2023-10-21T01:45:59Z

It looks like something is deleting files out from under the directory walk while it is iterating over the listing. I suspect this requires simultaneously taking and pruning snapshots to reproduce?

aganesh-suse · 2023-10-21T01:56:57Z

i think the prune was happening because of the cron set to take a snapshot every minute and a prune happens every 2 minutes - because we tried to retain only 2 snapshots.

etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"

in the meantime, i ran 5 on-demand saves in a loop. and the 4th save failed with the seg fault above.

yes. this looks like a timing issue of simultaneous prune and save snapshot that happened.

brandond · 2023-10-24T21:57:03Z

There is a lock that prevents multiple snapshots from running at the same time within a single process, but it doesn't prevent multiple CLI invocations from stepping on each other, or the CLI from stepping on the service. We should handle that better.

hdeadman · 2024-01-26T15:54:18Z

We were getting a similar crash of the rancher agent on v1.25.16+rke2r1 when the etcd-snapshot-dir folder doesn't exist. Two of the three controller servers had configured an nfs mount as the snapshot directory and the mount wasn't working for some reason so we kept seeing containerd crash, presumably b/c it is a child process of whatever was crashing due to bad etcd-snapshot-dir. The rancher agent was never able to completely come up. Our snapshot retention was 10 and cron schedule was every 8 hours.

hdeadman · 2024-01-26T16:13:14Z

Here was full stack trace from our logs. I think this should be prioritized higher than medium.

Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="rke2 is up and running"
Jan 26 15:23:21 redacted systemd: Started Rancher Kubernetes Engine v2 (server).
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jan 26 15:23:21 redacted rke2: panic: runtime error: invalid memory address or nil pointer dereference
Jan 26 15:23:21 redacted rke2: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f]
Jan 26 15:23:21 redacted rke2: goroutine 262 [running]:
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc0005eb374, 0x37}, {0x0, 0x0}, {0x3ad1ac0, 0xc000790a20})
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:439 +0x8f
Jan 26 15:23:21 redacted rke2: path/filepath.Walk({0xc0005eb374, 0x37}, 0xc001dd3048)
Jan 26 15:23:21 redacted rke2: /usr/local/go/src/path/filepath/path.go:562 +0x50
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0008f8a50)
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:438 +0xad
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0008f8a50, {0x3b00f48, 0xc000ad95e0})
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:735 +0xe9
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:110 +0xa4
Jan 26 15:23:21 redacted rke2: created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:101 +0x6ca
Jan 26 15:23:21 redacted systemd: rke2-server.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 15:23:21 redacted systemd: Unit rke2-server.service entered failed state.
Jan 26 15:23:21 redacted systemd: rke2-server.service failed.

brandond · 2024-01-26T19:13:40Z

Two of the three controller servers had configured an nfs mount as the snapshot directory and the mount wasn't working for some reason.

I don't believe this is something we've tested and should be considered unsupported.

Is the exact same NFS path shared by all the nodes? I would ensure that they're not all sharing the same path, otherwise you'll get weirdness like duplicate snapshots in the list, nodes pruning other nodes snapshots out from underneath each other, and so on. This is not intended to be a shared filesystem.

hdeadman · 2024-01-28T22:29:11Z

This particular issue wasn't caused by three controllers writing to the same shared folder. Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting. The logs did say status=2/INVALIDARGUMENT as the exit code but that isn't particularly helpful since it doesn't say which argument is invalid. If it is going to exit due to the directory being invalid, some logging that included the invalid path or config parameter would make it clear what the problem was.

brandond · 2024-01-29T22:25:17Z

Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting.

I'm not sure what you mean by "was an invalid directory". Even if you intended to have an NFS export mounted there, the directory would need to exist. Did the target snapshot directory not exist at all?

I can confirm that I can reproduce a crash when setting the snapshot dir to something that does not exist:

rke2 server '--etcd-snapshot-dir=/does/not/exist' '--etcd-snapshot-schedule-cron=* * * * *'

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f]

goroutine 259 [running]:
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000080044, 0xf}, {0x0, 0x0}, {0x3ad1ac0, 0xc001a7e6f0})
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:439 +0x8f
path/filepath.Walk({0xc000080044, 0xf}, 0xc000bab048)
	/usr/local/go/src/path/filepath/path.go:562 +0x50
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0004ce140)
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:438 +0xad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0004ce140, {0x3b00f48, 0xc000459180})
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:735 +0xe9
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:110 +0xa4
created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:101 +0x6ca

hdeadman · 2024-01-30T00:02:42Z

Correct, the folder did not exist at all (b/c the nfs mount didn't work). Since the folder normally did exist (if the NFS mount was working), the configuration was previously correct and it took us longer than I would like to figure out what the problem was. I am just asking for a better error message and/or exit of the process (with a message) when the folder does not exist. I don't think it is unreasonable that a program with an argument for a path would do a check to see if the path was valid and report it as being invalid rather than crashing.

…

On Mon, Jan 29, 2024 at 5:25 PM Brad Davidson ***@***.***> wrote: Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting. I'm not sure what you mean by "was an invalid directory". Even if you intended to have an NFS export mounted there, the directory would need to exist. Did the target snapshot directory not exist at all? I can confirm that I can reproduce a crash when setting the snapshot dir to something that does not exist: rke2 server '--etcd-snapshot-dir=/does/not/exist' '--etcd-snapshot-schedule-cron=* * * * *' panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f] goroutine 259 [running]:github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000080044 <http://github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1(%7B0xc000080044>, 0xf}, {0x0, 0x0}, {0x3ad1ac0, 0xc001a7e6f0}) ***@***.***/pkg/etcd/snapshot.go:439 +0x8f path/filepath.Walk({0xc000080044, 0xf}, 0xc000bab048) /usr/local/go/src/path/filepath/path.go:562 +0x50github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0004ce140) ***@***.***/pkg/etcd/snapshot.go:438 +0xadgithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0004ce140, {0x3b00f48, 0xc000459180}) ***@***.***/pkg/etcd/snapshot.go:735 +0xe9github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1() ***@***.***/pkg/cluster/cluster.go:110 +0xa4 created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start ***@***.***/pkg/cluster/cluster.go:101 +0x6ca — Reply to this email directly, view it on GitHub <#4942 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAL4CKJTZEA5VQO6NTPLTNTYRAOVPAVCNFSM6AAAAAA6JYOPKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVGY4DENRZHA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Hal Deadman 703-725-6686

brandond · 2024-01-30T04:37:01Z

Correct, the folder did not exist at all (b/c the nfs mount didn't work).

I mean, I'm guessing the mount didn't work because the folder didn't exist. The target directory has to exist for you to mount something there. It doesn't get created by the mount.

But yes, we shouldn't crash if you ask to backup to a path that doesn't exist.

hdeadman · 2024-01-30T13:27:20Z

In this case I believe the mount was something like /mnt/backup, which did exist but the config was pointing to a directory a few levels down from that. /mnt/backup probably existed but the mount didn't work for whatever reason so the folder referenced in the config didn't exist. NFS is not really relevant to the error handling of the snapshot code, I probably shouldn't have mentioned it.

…

On Mon, Jan 29, 2024 at 11:37 PM Brad Davidson ***@***.***> wrote: Correct, the folder did not exist at all (b/c the nfs mount didn't work). I mean, I'm guessing the mount didn't work because the folder didn't exist. The target directory has to exist for you to mount something there... — Reply to this email directly, view it on GitHub <#4942 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAL4CKLMOUF4MSIW4CYVYOLYRB2HRAVCNFSM6AAAAAA6JYOPKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJWGA3DAOBSHE> . You are receiving this because you commented.Message ID: ***@***.***>

-- Hal Deadman 703-725-6686

aganesh-suse · 2024-02-15T17:41:27Z

Issue found on master branch with commit `992194b`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

or

1 server/ 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1

etcd-snapshot-retention: 5
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: xxxx
etcd-s3-folder: xxxx
etcd-s3-region: xxxx

etcd-snapshot-dir: '/does/not/exist'
debug: true

Steps to reproduce:

Copy config.yaml

$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2

Install RKE2

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='992194bd7e2f1eb8a9bbd8f2f4ae78e3ee314b92' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -

Start the RKE2 service

$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent

Perform etcd-snapshot operations: save, prune, list, delete

Expected behavior:

Step 4: Perform etcd-snapshot operations: save, prune, list, delete - none of them should have a seg fault. Should exit gracefully.

Reproducing Results/Observations:

rke2 version used for replication:

$ rke2 -v
rke2 version v1.29.1+dev.992194bd (992194bd7e2f1eb8a9bbd8f2f4ae78e3ee314b92)
go version go1.21.6 X:boringcrypto

Prune operation seg faults:

$ sudo rke2 etcd-snapshot prune --snapshot-retention 3
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=info msg="Applying snapshot retention=3 to local snapshots with prefix on-demand in /does/not/exist"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x2f057bd]

goroutine 1 [running]:
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention.func1({0xc000272e24?, 0xf?}, {0x0, 0x0}, {0x4679e40, 0xc00052fb90})
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:930 +0x3d
path/filepath.Walk({0xc000272e24, 0xf}, 0xc000e3d958)
	/usr/local/go/src/path/filepath/path.go:570 +0x4a
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention(0x3, {0x3e4816b, 0x9}, {0xc000272e24, 0xf})
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:929 +0x1ad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).PruneSnapshots(0xc000d73540, {0x46b8430, 0xc000d72eb0})
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/snapshot.go:507 +0x6d
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.prune(0x0?, 0x6bed320)
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/etcdsnapshot/etcd_snapshot.go:280 +0x7c
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.Prune(0xc000afb340?)
	/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/etcdsnapshot/etcd_snapshot.go:267 +0x34
github.com/urfave/cli.HandleAction({0x3624820?, 0x41faac8?}, 0x5?)
	/go/pkg/mod/github.com/urfave/[email protected]/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x3e111b0, 0x5}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x4096cec, 0x56}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/[email protected]/command.go:175 +0x63e
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000a71880, 0xc000afb080)
	/go/pkg/mod/github.com/urfave/[email protected]/app.go:405 +0xe07
github.com/urfave/cli.Command.startApp({{0x3e54fda, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/[email protected]/command.go:380 +0xb58
github.com/urfave/cli.Command.Run({{0x3e54fda, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/[email protected]/command.go:103 +0x7e5
github.com/urfave/cli.(*App).Run(0xc000ba7a40, {0xc0009f00d0, 0xd, 0xd})
	/go/pkg/mod/github.com/urfave/[email protected]/app.go:277 +0xb27
main.main()
	/source/main.go:23 +0x97b

brandond · 2024-02-15T19:04:52Z

Moving this out to the next release to fix the prune subcommand.

aganesh-suse · 2024-03-13T00:12:53Z

Validated on master branch with commit `c7cd05b`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
debug: true

etcd-snapshot-retention: 5
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: xxxx
etcd-s3-folder: xxxx
etcd-s3-region: xxxx

etcd-snapshot-dir: '/does/not/exist'

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2

Install RKE2

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='c7cd05bf547712250bd7a47db69258dbf823c80a' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -

Start the RKE2 service

$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Perform etcd-snapshot operations: list, save, prune, delete. They should exit gracefully.

Validation Results:

rke2 version used for validation:

$ rke2 -v
rke2 version v1.29.2+dev.c7cd05bf (c7cd05bf547712250bd7a47db69258dbf823c80a)
go version go1.21.7 X:boringcrypto

List:

$ sudo rke2 etcd-snapshot list
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
INFO[0000] Checking if S3 bucket sonobuoy-results exists
INFO[0000] S3 bucket sonobuoy-results exists
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Save:

$ sudo rke2 etcd-snapshot save
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
DEBU[0000] Attempting to retrieve extra metadata from rke2-etcd-snapshot-extra-metadata ConfigMap
DEBU[0000] Error encountered attempting to retrieve extra metadata from rke2-etcd-snapshot-extra-metadata ConfigMap, error: configmaps "rke2-etcd-snapshot-extra-metadata" not found
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Prune:

$ sudo rke2 etcd-snapshot prune
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Delete:

$ sudo rke2 etcd-snapshot delete etcd-snapshot-x
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

brandond added this to the v1.28.4+rke2r1 milestone Oct 24, 2023

caroline-suse-rancher removed this from the v1.28.4+rke2r1 milestone Dec 11, 2023

caroline-suse-rancher added the priority/medium label Dec 11, 2023

brandond self-assigned this Jan 29, 2024

brandond added this to the v1.29.2+rke2r1 milestone Jan 29, 2024

brandond mentioned this issue Jan 30, 2024

Panic when etcd-snapshot-dir does not exist k3s-io/k3s#9316

Closed

rancher-max assigned aganesh-suse Feb 13, 2024

brandond modified the milestones: v1.29.2+rke2r1, v1.29.3+rke2r1 Feb 15, 2024

aganesh-suse closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault on running rke2 etcd-snapshot save #4942

Segmentation Fault on running rke2 etcd-snapshot save #4942

aganesh-suse commented Oct 21, 2023 •

edited

Loading

brandond commented Oct 21, 2023 •

edited

Loading

aganesh-suse commented Oct 21, 2023 •

edited

Loading

brandond commented Oct 24, 2023

hdeadman commented Jan 26, 2024

hdeadman commented Jan 26, 2024

brandond commented Jan 26, 2024

hdeadman commented Jan 28, 2024

brandond commented Jan 29, 2024

hdeadman commented Jan 30, 2024 via email

brandond commented Jan 30, 2024 •

edited

Loading

hdeadman commented Jan 30, 2024 via email

aganesh-suse commented Feb 15, 2024

brandond commented Feb 15, 2024

aganesh-suse commented Mar 13, 2024

Segmentation Fault on running rke2 etcd-snapshot save #4942

Segmentation Fault on running rke2 etcd-snapshot save #4942

Comments

aganesh-suse commented Oct 21, 2023 • edited Loading

brandond commented Oct 21, 2023 • edited Loading

aganesh-suse commented Oct 21, 2023 • edited Loading

brandond commented Oct 24, 2023

hdeadman commented Jan 26, 2024

hdeadman commented Jan 26, 2024

brandond commented Jan 26, 2024

hdeadman commented Jan 28, 2024

brandond commented Jan 29, 2024

hdeadman commented Jan 30, 2024 via email

brandond commented Jan 30, 2024 • edited Loading

hdeadman commented Jan 30, 2024 via email

aganesh-suse commented Feb 15, 2024

Issue found on master branch with commit 992194b

Environment Details

Steps to reproduce:

brandond commented Feb 15, 2024

aganesh-suse commented Mar 13, 2024

Validated on master branch with commit c7cd05b

Environment Details

Testing Steps

aganesh-suse commented Oct 21, 2023 •

edited

Loading

brandond commented Oct 21, 2023 •

edited

Loading

aganesh-suse commented Oct 21, 2023 •

edited

Loading

brandond commented Jan 30, 2024 •

edited

Loading

Issue found on master branch with commit `992194b`

Validated on master branch with commit `c7cd05b`