Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a node leaves a cluster before it closes it will change the ring to a fresh ring (which knows only of the local node). This is necessary to ensure that restarting the node does not lead to any attempt to rejoin.
There is an issue though, that the close of riak is not immediate. It is, as one would expect, an orderly shutdown of the dependencies, before riak_core is the final application to close.
During this closing process, there are multiple processes which are monitoring the ring and may react to the change in the ring. At this stage, as the shutdown is not complete these processes may take unnecessary action, and as the node is still connected to he cluster at an erlang level - this could leak into the rest of the cluster.
This was causing significant problems for
riak_ensemble
, where ensembles would be changed through the cluster incorrectly. In some cases there would then be a version mismatch that meant that this was never corrected back.This PR introduces 'lastgasp' metadata to the ring. If the ring is being changed to reflect shutdown on departure this metadata is set after the ring has been persisted (so it is not seen on restart).
There are related PRs then due in
riak_kv
andriak_repl
with ring_handlers amended to check thislastgasp
status before reacting to a ring change.These changes have stabilised ensemble riak_test tests (without actually changing
riak_ensemble
), and also reduce unexpected crashes during the shutdown processes caused by the false starting of elections/vnodes.