Should Ra return an error while multiple nodes try to leave the cluster? #211

RoadRunnr · 2021-02-02T09:23:21Z

tested on 1.1.8

When multiple nodes try to leave the cluster at once (e.g. when trying to stop all nodes in a unit test), it can happen that the master leaves and another node also tries to leave at almost the same time. When the old master process has already terminated, but no new master has been elected yet, a call to ra:leave_and_delete_server on the local node can return {error, noproc}.

IMHO it should be {error,cluster_change_not_permitted} as long as there is at least one node (e.g. the local node) left in the cluster.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2021-02-02T11:07:54Z

Sounds good. Ra intentionally limits membership changes to one at a time due to how complex reasoning about multi-member changes are. Some other Raft implementations have chosen to adopt the same limitations.

This is open source software, so please submit a PR.

RoadRunnr · 2021-02-02T11:30:47Z

I think this issue should not be a question, maybe I didn't describe the problem sufficiently enough.

When multiple ra nodes try to leave the cluster at about the same time, the result of calling ra:leave_and_delete_server can have the following values:

ok
{error,cluster_change_not_permitted}
{error,noproc}
a crash {'EXIT',{normal,{gen_statem,call,[Leader,{leader_call,{command,normal,{'$ra_leave',Node,await_consensus}}},5000]}}}

The {error,noproc} and the crash a clearly abnormal returns.

I would like to submit a PR to fix this bug. But the time required to understand the internals of ra, then fix the bug and possibly write a test case, means that this will happen sometime far, far in the future.

kjnilsson · 2021-02-02T11:57:24Z

Some of these come from natural race conditions where a leader is removed and a request is issued to the old leader pid. I suspect retries is the only reasonable way forward here

RoadRunnr · 2021-02-02T12:07:00Z

Some of these come from natural race conditions where a leader is removed and a request is issued to the old leader pid. I suspect retries is the only reasonable way forward here

That's what I'm doing in my test suite for all the error cases.
However, I would expect that in the noproc and in the crash case, the right thing for the API would be to return {error,cluster_change_not_permitted}. noproc would only make sense in the case where the targeted ServerRef is not running.

kjnilsson · 2021-02-02T12:47:18Z

noproc is probably returned when the leader has gone away and the call is redirected to the now gone leader. If this is the case (I haven't validated) then I think noproc is appropriate

michaelklishin changed the title ~~error while multiple nodes try to leave the cluster~~ Should Ra return an error while multiple nodes try to leave the cluster? Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should Ra return an error while multiple nodes try to leave the cluster? #211

Should Ra return an error while multiple nodes try to leave the cluster? #211

RoadRunnr commented Feb 2, 2021

michaelklishin commented Feb 2, 2021

RoadRunnr commented Feb 2, 2021

kjnilsson commented Feb 2, 2021

RoadRunnr commented Feb 2, 2021

kjnilsson commented Feb 2, 2021

Should Ra return an error while multiple nodes try to leave the cluster? #211

Should Ra return an error while multiple nodes try to leave the cluster? #211

Comments

RoadRunnr commented Feb 2, 2021

michaelklishin commented Feb 2, 2021

RoadRunnr commented Feb 2, 2021

kjnilsson commented Feb 2, 2021

RoadRunnr commented Feb 2, 2021

kjnilsson commented Feb 2, 2021