Omni erroneously removes a live etcd member #750

smira · 2024-12-02T13:56:30Z

justyns · 2024-12-24T22:25:12Z

FWIW, I also ran into this issue when upgrading from Talos 1.8.1 to 1.8.4. I didn't do anything special, just updated the cluster template and ran omnictl cluster template sync. The first node upgraded without issue, the second control plane node got stuck in an etcd restart loop with errors like:

100.64.0.77: {"level":"info","ts":"2024-12-24T22:14:15.262169Z","caller":"etcdserver/server.go:864","msg":"starting etcd server","local-member-id":"66de3babc0abe198","local-server-version":"3.5.17","cluster-id":"78fa62a055012012","cluster-version":"3.5"}                           
100.64.0.77: {"level":"info","ts":"2024-12-24T22:14:15.262330Z","caller":"etcdserver/server.go:773","msg":"starting initial election tick advance","election-ticks":5}                                                                                                                   
100.64.0.77: {"level":"warn","ts":"2024-12-24T22:14:15.262417Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}                                                                                            
100.64.0.77: {"level":"warn","ts":"2024-12-24T22:14:15.262494Z","caller":"etcdserver/server.go:1155","msg":"data-dir used by this member must be removed"}

To "fix" it, I rebooted and selected the "Reset installation and reboot into maintenance mode" option for the 100.64.0.77 node. It eventually came back up.

I made a support bundle with omnictl before resetting the node. Would it be helpful to send somewhere?

Unix4ever · 2024-12-25T11:39:17Z

Did you have any extensions installed on the nodes?

The logic of the etcd audit got outdated with the more recent Talos versions. `apid` now runs in the states where it wasn't available before, so the check for the etcd member might lead to the false-positives. Also reorder the `auditMember` check sequence to be more correct. Fixes: siderolabs#750 Signed-off-by: Artem Chernyshev <[email protected]>

The logic of the etcd audit got outdated with the more recent Talos versions. `apid` now runs in the states where it wasn't available before, so the check for the etcd member might lead to the false-positives. Also reorder the `auditMember` check sequence to be more correct. Fixes: #750 Signed-off-by: Artem Chernyshev <[email protected]> (cherry picked from commit 82da2f4)

Unix4ever · 2024-12-25T17:03:12Z

I've identified the root cause. The fix has landed in 0.45.1.

smira assigned Unix4ever Dec 23, 2024

Unix4ever mentioned this issue Dec 25, 2024

fix: never remove etcd members which ID is discovered at least once #804

Merged

talos-bot closed this as completed in 82da2f4 Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omni erroneously removes a live etcd member #750

Omni erroneously removes a live etcd member #750

smira commented Dec 2, 2024 •

edited

Loading

justyns commented Dec 24, 2024 •

edited

Loading

Unix4ever commented Dec 25, 2024

Unix4ever commented Dec 25, 2024 •

edited

Loading

Omni erroneously removes a live etcd member #750

Omni erroneously removes a live etcd member #750

Comments

smira commented Dec 2, 2024 • edited Loading

justyns commented Dec 24, 2024 • edited Loading

Unix4ever commented Dec 25, 2024

Unix4ever commented Dec 25, 2024 • edited Loading

smira commented Dec 2, 2024 •

edited

Loading

justyns commented Dec 24, 2024 •

edited

Loading

Unix4ever commented Dec 25, 2024 •

edited

Loading