Replies: 1 comment 14 replies
-
This error is coming from etcd, not k3s itself. It appears to be pretty clear that the cluster ID across your nodes does not match. I'm not really sure how you might get into this state, where replication is working but the cluster ID is out of sync. You might ask at https://github.com/etcd-io/etcd/discussions ? In order to fix this, you will probably need to one by one remove the servers from the cluster until there is only one etcd cluster member, and then re-add them, ensuring that they all join against the same initial node with the same cluster ID. |
Beta Was this translation helpful? Give feedback.
14 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Environmental Info:
k3s version v1.27.5+k3s1 (8d074ec)
go version go1.20.7
Node(s) CPU architecture, OS, and Version:
Linux 5.15.0-84-generic #93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 servers
3 agents
Bare-metal x86
Describe the bug:
I've got a K3S cluster with 3 masters and 3 workers that's been running just fine for the past several months. However, my first etcd master node (node1) is spamming my journalctl log multiple times a second with this pair of similar entries:
The same error does not occur on the other two masters. Other than the spamming, everything appears to be fine. I can use etcdctl to change leader to/from any of the 3 nodes. Bringing down any one node has no ill effect on the overall cluster.
I've tried deleting and rejoining the node from the cluster multiple times. I've re-imaged the node (but using the same computer name). I've tried compacting and defragging the database multiple times. I've removed two of the three masters and re-joined. I've tried backing up and restoring etcd database while initiating a cluster-reset. But no matter what I do, as soon as I bring node1 back online, the errors start spamming on that node. The only thing that changes in the error message is the local-member-id
The interesting thing is that the IDs for remote-peer-cluster-id, remote-peer-server-name, and local-member-cluster-id never changes, even though those nodes have been removed/rejoined several times and have new IDs. It seems as though there is some stale info in the database that I have no idea how to get rid of. Again, everything else seems fine, except for the log spam on node1.
How can I clean up the etcd database to get rid of these old entries (assuming that's the issue here)?
Node endpoint status:
Steps To Reproduce:
curl -sfL https://get.k3s.io | K3S_TOKEN=<TOKEN_REDACTED> sh -s - server --server https://<IP_REDACTED>:6443 --disable servicelb,traefik,local-storage --write-kubeconfig-mode 644 --disable-cloud-controller --flannel-backend=wireguard-native --prefer-bundled-bin
Expected behavior:
Stop the "request cluster ID mismatch" error from spamming my journalctl logs
Beta Was this translation helpful? Give feedback.
All reactions