-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault on running rke2 etcd-snapshot save #4942
Comments
It looks like something is deleting files out from under the directory walk while it is iterating over the listing. I suspect this requires simultaneously taking and pruning snapshots to reproduce? |
i think the prune was happening because of the cron set to take a snapshot every minute and a prune happens every 2 minutes - because we tried to retain only 2 snapshots.
in the meantime, i ran 5 on-demand saves in a loop. and the 4th save failed with the seg fault above. yes. this looks like a timing issue of simultaneous prune and save snapshot that happened. |
There is a lock that prevents multiple snapshots from running at the same time within a single process, but it doesn't prevent multiple CLI invocations from stepping on each other, or the CLI from stepping on the service. We should handle that better. |
We were getting a similar crash of the rancher agent on v1.25.16+rke2r1 when the |
Here was full stack trace from our logs. I think this should be prioritized higher than medium.
|
I don't believe this is something we've tested and should be considered unsupported. Is the exact same NFS path shared by all the nodes? I would ensure that they're not all sharing the same path, otherwise you'll get weirdness like duplicate snapshots in the list, nodes pruning other nodes snapshots out from underneath each other, and so on. This is not intended to be a shared filesystem. |
This particular issue wasn't caused by three controllers writing to the same shared folder. Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting. The logs did say |
I'm not sure what you mean by "was an invalid directory". Even if you intended to have an NFS export mounted there, the directory would need to exist. Did the target snapshot directory not exist at all? I can confirm that I can reproduce a crash when setting the snapshot dir to something that does not exist:
|
Correct, the folder did not exist at all (b/c the nfs mount didn't work).
Since the folder normally did exist (if the NFS mount was working), the
configuration was previously correct and it took us longer than I would
like to figure out what the problem was. I am just asking for a better
error message and/or exit of the process (with a message) when the folder
does not exist. I don't think it is unreasonable that a program with an
argument for a path would do a check to see if the path was valid and
report it as being invalid rather than crashing.
…On Mon, Jan 29, 2024 at 5:25 PM Brad Davidson ***@***.***> wrote:
Two of the three mounts weren't mounted so the path on two of the servers
was an invalid directory and that was causing the stack trace and the main
process exiting.
I'm not sure what you mean by "was an invalid directory". Even if you
intended to have an NFS export mounted there, the directory would need to
exist. Did the target snapshot directory not exist at all?
I can confirm that I can reproduce a crash when setting the snapshot dir
to something that does not exist:
rke2 server '--etcd-snapshot-dir=/does/not/exist'
'--etcd-snapshot-schedule-cron=* * * * *'
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f]
goroutine 259 [running]:github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000080044 <http://github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1(%7B0xc000080044>, 0xf}, {0x0, 0x0}, {0x3ad1ac0, 0xc001a7e6f0})
***@***.***/pkg/etcd/snapshot.go:439 +0x8f
path/filepath.Walk({0xc000080044, 0xf}, 0xc000bab048)
/usr/local/go/src/path/filepath/path.go:562 +0x50github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0004ce140)
***@***.***/pkg/etcd/snapshot.go:438 +0xadgithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0004ce140, {0x3b00f48, 0xc000459180})
***@***.***/pkg/etcd/snapshot.go:735 +0xe9github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
***@***.***/pkg/cluster/cluster.go:110 +0xa4
created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
***@***.***/pkg/cluster/cluster.go:101 +0x6ca
—
Reply to this email directly, view it on GitHub
<#4942 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAL4CKJTZEA5VQO6NTPLTNTYRAOVPAVCNFSM6AAAAAA6JYOPKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVGY4DENRZHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hal Deadman
703-725-6686
|
I mean, I'm guessing the mount didn't work because the folder didn't exist. The target directory has to exist for you to mount something there. It doesn't get created by the mount. But yes, we shouldn't crash if you ask to backup to a path that doesn't exist. |
In this case I believe the mount was something like /mnt/backup, which did
exist but the config was pointing to a directory a few levels down from
that. /mnt/backup probably existed but the mount didn't work for whatever
reason so the folder referenced in the config didn't exist. NFS is not
really relevant to the error handling of the snapshot code, I probably
shouldn't have mentioned it.
…On Mon, Jan 29, 2024 at 11:37 PM Brad Davidson ***@***.***> wrote:
Correct, the folder did not exist at all (b/c the nfs mount didn't work).
I mean, I'm guessing the mount didn't work because the folder didn't
exist. The target directory has to exist for you to mount something there...
—
Reply to this email directly, view it on GitHub
<#4942 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAL4CKLMOUF4MSIW4CYVYOLYRB2HRAVCNFSM6AAAAAA6JYOPKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJWGA3DAOBSHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Hal Deadman
703-725-6686
|
Issue found on master branch with commit 992194bEnvironment DetailsInfrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
or
Config.yaml:
Steps to reproduce:
Expected behavior: Step 4: Perform etcd-snapshot operations: save, prune, list, delete - none of them should have a seg fault. Should exit gracefully. Reproducing Results/Observations:
Prune operation seg faults:
|
Moving this out to the next release to fix the |
Validated on master branch with commit c7cd05bEnvironment DetailsInfrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Config.yaml:
Testing Steps
Validation Results:
List:
Save:
Prune:
Delete:
|
Environmental Info:
RKE2 Version:
Node(s) CPU architecture, OS, and Version:
Ubuntu 22.04
Cluster Configuration:
1 server/1 agent
Describe the bug:
Hit a segmentation fault in one of the etcd-snapshot save operations:
Steps To Reproduce:
Hit the seg fault at loop 4 of the save operation above.
Also, the cron is set in the config.yaml to take a snapshot every 1 min, and retain only 2 snapshots. So, there is a save/delete snapshot happening every minute.
The text was updated successfully, but these errors were encountered: