Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omni erroneously removes a live etcd member #750

Closed
smira opened this issue Dec 2, 2024 · 3 comments · Fixed by #804
Closed

Omni erroneously removes a live etcd member #750

smira opened this issue Dec 2, 2024 · 3 comments · Fixed by #804
Assignees

Comments

@smira
Copy link
Member

smira commented Dec 2, 2024

#745

@justyns
Copy link

justyns commented Dec 24, 2024

FWIW, I also ran into this issue when upgrading from Talos 1.8.1 to 1.8.4. I didn't do anything special, just updated the cluster template and ran omnictl cluster template sync. The first node upgraded without issue, the second control plane node got stuck in an etcd restart loop with errors like:

100.64.0.77: {"level":"info","ts":"2024-12-24T22:14:15.262169Z","caller":"etcdserver/server.go:864","msg":"starting etcd server","local-member-id":"66de3babc0abe198","local-server-version":"3.5.17","cluster-id":"78fa62a055012012","cluster-version":"3.5"}                           
100.64.0.77: {"level":"info","ts":"2024-12-24T22:14:15.262330Z","caller":"etcdserver/server.go:773","msg":"starting initial election tick advance","election-ticks":5}                                                                                                                   
100.64.0.77: {"level":"warn","ts":"2024-12-24T22:14:15.262417Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}                                                                                            
100.64.0.77: {"level":"warn","ts":"2024-12-24T22:14:15.262494Z","caller":"etcdserver/server.go:1155","msg":"data-dir used by this member must be removed"}                                                                                                                               

To "fix" it, I rebooted and selected the "Reset installation and reboot into maintenance mode" option for the 100.64.0.77 node. It eventually came back up.

I made a support bundle with omnictl before resetting the node. Would it be helpful to send somewhere?

@Unix4ever
Copy link
Member

Did you have any extensions installed on the nodes?

Unix4ever added a commit to Unix4ever/omni that referenced this issue Dec 25, 2024
The logic of the etcd audit got outdated with the more recent Talos
versions. `apid` now runs in the states where it wasn't available
before, so the check for the etcd member might lead to the
false-positives.
Also reorder the `auditMember` check sequence to be more correct.

Fixes: siderolabs#750

Signed-off-by: Artem Chernyshev <[email protected]>
Unix4ever added a commit to Unix4ever/omni that referenced this issue Dec 25, 2024
The logic of the etcd audit got outdated with the more recent Talos
versions. `apid` now runs in the states where it wasn't available
before, so the check for the etcd member might lead to the
false-positives.
Also reorder the `auditMember` check sequence to be more correct.

Fixes: siderolabs#750

Signed-off-by: Artem Chernyshev <[email protected]>
Unix4ever added a commit that referenced this issue Dec 25, 2024
The logic of the etcd audit got outdated with the more recent Talos
versions. `apid` now runs in the states where it wasn't available
before, so the check for the etcd member might lead to the
false-positives.
Also reorder the `auditMember` check sequence to be more correct.

Fixes: #750

Signed-off-by: Artem Chernyshev <[email protected]>
(cherry picked from commit 82da2f4)
@Unix4ever
Copy link
Member

Unix4ever commented Dec 25, 2024

I've identified the root cause. The fix has landed in 0.45.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants