One master node continually spamming logs with "request cluster ID mismatch" #8400

kenlasko · 2023-09-21T16:12:52Z

kenlasko
Sep 21, 2023

Environmental Info:
k3s version v1.27.5+k3s1 (8d074ec)
go version go1.20.7

Node(s) CPU architecture, OS, and Version:
Linux 5.15.0-84-generic #93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 servers
3 agents
Bare-metal x86

Describe the bug:
I've got a K3S cluster with 3 masters and 3 workers that's been running just fine for the past several months. However, my first etcd master node (node1) is spamming my journalctl log multiple times a second with this pair of similar entries:

Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.464935Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"7445bf6b546bba98","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}
Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.473258Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"ebcd726a6bf925f1","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}

The same error does not occur on the other two masters. Other than the spamming, everything appears to be fine. I can use etcdctl to change leader to/from any of the 3 nodes. Bringing down any one node has no ill effect on the overall cluster.

I've tried deleting and rejoining the node from the cluster multiple times. I've re-imaged the node (but using the same computer name). I've tried compacting and defragging the database multiple times. I've removed two of the three masters and re-joined. I've tried backing up and restoring etcd database while initiating a cluster-reset. But no matter what I do, as soon as I bring node1 back online, the errors start spamming on that node. The only thing that changes in the error message is the local-member-id

The interesting thing is that the IDs for remote-peer-cluster-id, remote-peer-server-name, and local-member-cluster-id never changes, even though those nodes have been removed/rejoined several times and have new IDs. It seems as though there is some stale info in the database that I have no idea how to get rid of. Again, everything else seems fine, except for the log spam on node1.

How can I clean up the etcd database to get rid of these old entries (assuming that's the issue here)?

Node endpoint status:

+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node1:2379     | 33e3c62f80baaab0 |   3.5.9 |   20 MB |      true |      false |         7 |     590511 |             590511 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node2:2379     | 354c354595d42dd  |   3.5.9 |   20 MB |     false |      false |         7 |     590519 |             590519 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node3:2379     | 189e834ffafa993  |   3.5.9 |   20 MB |     false |      false |         7 |     590524 |             590524 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Steps To Reproduce:

Installed K3s:
curl -sfL https://get.k3s.io | K3S_TOKEN=<TOKEN_REDACTED> sh -s - server --server https://<IP_REDACTED>:6443 --disable servicelb,traefik,local-storage --write-kubeconfig-mode 644 --disable-cloud-controller --flannel-backend=wireguard-native --prefer-bundled-bin

Expected behavior:
Stop the "request cluster ID mismatch" error from spamming my journalctl logs

brandond · 2023-09-21T16:49:00Z

brandond
Sep 21, 2023
Collaborator

This error is coming from etcd, not k3s itself. It appears to be pretty clear that the cluster ID across your nodes does not match. I'm not really sure how you might get into this state, where replication is working but the cluster ID is out of sync. You might ask at https://github.com/etcd-io/etcd/discussions ?

In order to fix this, you will probably need to one by one remove the servers from the cluster until there is only one etcd cluster member, and then re-add them, ensuring that they all join against the same initial node with the same cluster ID.

14 replies

kenlasko Sep 22, 2023
Author

Its only the node1 that's complaining about the invalid cluster. That's the one I've removed/rebuilt/reinstalled several times with no change. That very same node is participating fine in the cluster. I have moved the Leader amongst all of them without issue. Yet the constant errors persist. I am positive there isn't another node in my network. Even if there were, the local-member-id in the error message changes every time I rebuild the node, and that same MemberID matches what I see in the etcdctl output. And we've just seen that the local cluster ID matches what all the nodes have as well.

I've been keeping up with my cluster upgrades and upgraded through every 1.27.x revision. I wonder if there was an issue during an upgrade of my cluster at some point that resulted in the old cluster version details not being removed from etcd and is causing the incessant log spamming, even though everything appears to be functioning fine.

brandond Sep 22, 2023
Collaborator

Right, that would all make sense if the rogue node or nodes are only connecting to node1. The local node ID should change every time you rebuild, as it is reset. The remote peer IDs do not though, nor does the cluster ID, right? The log message is coming from rafthttp, which means it's from a request coming in over the network.

Shut down k3s on the other two nodes, and check for traffic to the peer and client ports on node1. I bet you'll find something.

kenlasko Sep 22, 2023
Author

The only nodes in the network are the 3 master nodes and the 3 workers. There aren't any rogue nodes. Besides, wouldn't it be impossible for the same node to have the same member ID in two clusters? That's what appears to be happening. The error logs on node1 show the local-member-id to be the same as the member ID in the "real" cluster.

kenlasko Sep 22, 2023
Author

I tried shutting down the other two master nodes (node2/3) and the errors stopped. However, I think this is more to do with ETCD shutting down on quorum loss. As soon as I started up either node, the errors started on node1 again. Does this tell us anything?

kenlasko Sep 22, 2023
Author

OK, I forgot about my two RPis.

I did a tcpdump and noticed lots of traffic from my RPis to node1 on port 2380. They were cluster members at one point, but were removed because performance was garbage. It would also appear that at some point, they were also master nodes which I don't recall doing at all. Must have been while I was first learning Kubernetes. They were still acting as master nodes. One could definitely argue that I still don't know what I'm doing. :)

I shut down both RPis and the errors on node1 stopped immediately. When I started them up, so did the errors. A simple k3s-uninstall.sh did the trick.

On the upside, I now have a much better understanding of etcd.

Thank you @brandond for making me see the light, no matter how dim I might appear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One master node continually spamming logs with "request cluster ID mismatch" #8400

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

One master node continually spamming logs with "request cluster ID mismatch" #8400

kenlasko Sep 21, 2023

Replies: 1 comment · 14 replies

brandond Sep 21, 2023 Collaborator

kenlasko Sep 22, 2023 Author

brandond Sep 22, 2023 Collaborator

kenlasko Sep 22, 2023 Author

kenlasko Sep 22, 2023 Author

kenlasko Sep 22, 2023 Author

kenlasko
Sep 21, 2023

Replies: 1 comment 14 replies

brandond
Sep 21, 2023
Collaborator

kenlasko Sep 22, 2023
Author

brandond Sep 22, 2023
Collaborator

kenlasko Sep 22, 2023
Author

kenlasko Sep 22, 2023
Author

kenlasko Sep 22, 2023
Author