Buggy cluster initialization #233

ZaychukAleksey · 2021-07-19T11:52:14Z

Hello. I found 2, IMO, buggy examples.

Run echo_server example in different terminals nodes:

./echo_server 1 localhost:10001
./echo_server 2 localhost:10002
./echo_server 3 localhost:10003

Example 1

Then add node 2 to the first server and to the third:

# first terminal
calc 1> add 2 localhost:10002

# third terminal
calc 3> add 2 localhost:10002

Ok. Let's see the list of nodes on each server:

calc 1> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002

#--
calc 2> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002

#--
calc 3> list
server id 3: localhost:10003 (LEADER)
server id 2: localhost:10002

How can it be, that node 2 is listed now in, actually, two clusters with different leaders?

Note, that node 2 really follows node 1, it ignores logs from node 3.

If I shut down node 1, some time later node 2 starts following node 3, although the following command doesn't even show node 3 and the current leader:

calc 2> list
server id 1: localhost:10001
server id 2: localhost:10002

Another funny thing occurs when I restart node 1 and add node 2 to it (because after the restart, it thinks it's the only leader node in the cluster). Node 2 still continue receiving logs from node 3 only, but it thinks that node 1 is now a leader:

calc 2> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002

Example 2

Add node 2 to node 1, then add node 1 to node 3.

# first terminal
calc 1> add 2 localhost:10002

# third terminal
calc 3> add 1 localhost:10001

Now the picture is as follows:

calc 1> list
server id 1: localhost:10001
server id 2: localhost:10002

calc 2> list
server id 1: localhost:10001
server id 2: localhost:10002

calc 3> list
server id 3: localhost:10003 (LEADER)
server id 1: localhost:10001

Node 1 accepts logs from node 3 although it doesn't even report that node 3 exists. Node 2 accepts nothing, it's just dummy.

How I fell into that situation: My task is a distributed application, each node knows only the list of other nodes. If a cluster isn't formed (e.g. it's the first time it's run or the previous cluster has decayed), then it should be formed automatically once two or more nodes are alive (w/o "run all applications and only after that pick a leader and register all other nodes on the leader"). Therefore I've tried the following naive approach: when a node starts, it adds (add_srv) other nodes. But suddently I found two problems:

The problem described above if one node is "simultaneously" added to the other two independent (yet) nodes.
I can't add nodes that aren't alive yet.

What's going on in the examples? How to assemble the cluster automatically based only on the list of nodes (some of them may be offline at the moment)?

The text was updated successfully, but these errors were encountered:

greensky00 · 2021-07-19T16:11:11Z

Hi @ZaychukAleksey

You can refer to the comment here
#224 (comment)
What you did was a sort of sabotage -- the behavior is undefined under such a wrong usage. Raft uses term to decide which message to follow, based on the belief that there is no split-brain. You made a split-brain by yourself.

If you want to have multiple members in the cluster at the time of the initialization of each member, you can let the state manager return a pre-defined cluster config, which contains the member list you want to have. Please refer to the comment here:
#196 (comment)

ZaychukAleksey · 2021-07-20T06:43:39Z

Oh... thanks for the clarification, @greensky00.
Seems it's worth mentioning load_config in this wiki section to prevent spawning issues like this one.

greensky00 · 2021-07-20T17:08:04Z

Agreed. Will update the page as well as examples.

fwhdzh · 2024-04-25T09:25:16Z

Hello, I believe the behavior described in this issue is easily misunderstood and should be addressed. The inconsistency in cluster listings across different nodes can cause confusion. Additionally, in real-world environments, it's useful for the system to tolerate certain potential misoperations.

I've submitted a fix pull request #504 for this issue. I would appreciate it if you could take a look at it when convenient.

fwhdzh mentioned this issue Apr 25, 2024

<fix>: prevent a server joining another cluster whlit it is already in one. #504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buggy cluster initialization #233

Buggy cluster initialization #233

ZaychukAleksey commented Jul 19, 2021

greensky00 commented Jul 19, 2021 •

edited

Loading

ZaychukAleksey commented Jul 20, 2021

greensky00 commented Jul 20, 2021

fwhdzh commented Apr 25, 2024 •

edited

Loading

Buggy cluster initialization #233

Buggy cluster initialization #233

Comments

ZaychukAleksey commented Jul 19, 2021

greensky00 commented Jul 19, 2021 • edited Loading

ZaychukAleksey commented Jul 20, 2021

greensky00 commented Jul 20, 2021

fwhdzh commented Apr 25, 2024 • edited Loading

greensky00 commented Jul 19, 2021 •

edited

Loading

fwhdzh commented Apr 25, 2024 •

edited

Loading