Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buggy cluster initialization #233

Open
ZaychukAleksey opened this issue Jul 19, 2021 · 4 comments
Open

Buggy cluster initialization #233

ZaychukAleksey opened this issue Jul 19, 2021 · 4 comments

Comments

@ZaychukAleksey
Copy link

Hello. I found 2, IMO, buggy examples.

Run echo_server example in different terminals nodes:

./echo_server 1 localhost:10001
./echo_server 2 localhost:10002
./echo_server 3 localhost:10003
Example 1

Then add node 2 to the first server and to the third:

# first terminal
calc 1> add 2 localhost:10002
# third terminal
calc 3> add 2 localhost:10002

Ok. Let's see the list of nodes on each server:

calc 1> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002

#--
calc 2> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002

#--
calc 3> list
server id 3: localhost:10003 (LEADER)
server id 2: localhost:10002

How can it be, that node 2 is listed now in, actually, two clusters with different leaders?

Note, that node 2 really follows node 1, it ignores logs from node 3.

If I shut down node 1, some time later node 2 starts following node 3, although the following command doesn't even show node 3 and the current leader:

calc 2> list
server id 1: localhost:10001
server id 2: localhost:10002

Another funny thing occurs when I restart node 1 and add node 2 to it (because after the restart, it thinks it's the only leader node in the cluster). Node 2 still continue receiving logs from node 3 only, but it thinks that node 1 is now a leader:

calc 2> list
server id 1: localhost:10001 (LEADER)
server id 2: localhost:10002
Example 2

Add node 2 to node 1, then add node 1 to node 3.

# first terminal
calc 1> add 2 localhost:10002
# third terminal
calc 3> add 1 localhost:10001

Now the picture is as follows:

calc 1> list
server id 1: localhost:10001
server id 2: localhost:10002

calc 2> list
server id 1: localhost:10001
server id 2: localhost:10002

calc 3> list
server id 3: localhost:10003 (LEADER)
server id 1: localhost:10001

Node 1 accepts logs from node 3 although it doesn't even report that node 3 exists. Node 2 accepts nothing, it's just dummy.

How I fell into that situation: My task is a distributed application, each node knows only the list of other nodes. If a cluster isn't formed (e.g. it's the first time it's run or the previous cluster has decayed), then it should be formed automatically once two or more nodes are alive (w/o "run all applications and only after that pick a leader and register all other nodes on the leader"). Therefore I've tried the following naive approach: when a node starts, it adds (add_srv) other nodes. But suddently I found two problems:

  • The problem described above if one node is "simultaneously" added to the other two independent (yet) nodes.
  • I can't add nodes that aren't alive yet.

What's going on in the examples? How to assemble the cluster automatically based only on the list of nodes (some of them may be offline at the moment)?

@greensky00
Copy link
Contributor

greensky00 commented Jul 19, 2021

Hi @ZaychukAleksey

You can refer to the comment here
#224 (comment)
What you did was a sort of sabotage -- the behavior is undefined under such a wrong usage. Raft uses term to decide which message to follow, based on the belief that there is no split-brain. You made a split-brain by yourself.

If you want to have multiple members in the cluster at the time of the initialization of each member, you can let the state manager return a pre-defined cluster config, which contains the member list you want to have. Please refer to the comment here:
#196 (comment)

@ZaychukAleksey
Copy link
Author

Oh... thanks for the clarification, @greensky00.
Seems it's worth mentioning load_config in this wiki section to prevent spawning issues like this one.

@greensky00
Copy link
Contributor

Agreed. Will update the page as well as examples.

@fwhdzh
Copy link
Contributor

fwhdzh commented Apr 25, 2024

Hello, I believe the behavior described in this issue is easily misunderstood and should be addressed. The inconsistency in cluster listings across different nodes can cause confusion. Additionally, in real-world environments, it's useful for the system to tolerate certain potential misoperations.

I've submitted a fix pull request #504 for this issue. I would appreciate it if you could take a look at it when convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants