Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 3.0 lastgaspring #945

Merged
merged 4 commits into from
Nov 5, 2019
Merged

Conversation

martinsumner
Copy link
Contributor

When a node leaves a cluster before it closes it will change the ring to a fresh ring (which knows only of the local node). This is necessary to ensure that restarting the node does not lead to any attempt to rejoin.

There is an issue though, that the close of riak is not immediate. It is, as one would expect, an orderly shutdown of the dependencies, before riak_core is the final application to close.

During this closing process, there are multiple processes which are monitoring the ring and may react to the change in the ring. At this stage, as the shutdown is not complete these processes may take unnecessary action, and as the node is still connected to he cluster at an erlang level - this could leak into the rest of the cluster.

This was causing significant problems for riak_ensemble, where ensembles would be changed through the cluster incorrectly. In some cases there would then be a version mismatch that meant that this was never corrected back.

This PR introduces 'lastgasp' metadata to the ring. If the ring is being changed to reflect shutdown on departure this metadata is set after the ring has been persisted (so it is not seen on restart).

There are related PRs then due in riak_kv and riak_repl with ring_handlers amended to check this lastgasp status before reacting to a ring change.

These changes have stabilised ensemble riak_test tests (without actually changing riak_ensemble), and also reduce unexpected crashes during the shutdown processes caused by the false starting of elections/vnodes.

Allow ring to have metadata indicating it is lastgasp - so that subscribers to ring changes cna ignore the last gasp ring defined just before the close.

Prompt for a tick before the close of the riak_core_claimant - to make sure there is always a tick before the last gasp ring is created.
The riak_core_vnode_manager has a mangement tick that may occur during shutdown - that is whilst dependencies are closing, and cor eis not yet closed, the riak_core_vnode_manager could receive and act on a management tick.

When a node is exiting, this cna cause the vnode_manager to start the vnodes required in the fresh ring that was generated as the "last gasp" before closing (so that at on restarting it will be an empty and detatched vnode).

This change prevents that restart - and avoids the associated log noise when those started vnodes fail to start due to the shutdown process.
to make it easier should it ever need to be unset via remote_console should something go wrong.  Also provides function to unset tainted ring re #744
@martinsumner
Copy link
Contributor Author

The set_ring/2 function will update the glocal ring - so in the previous version there was a short window where a process could see the lastgaps ring without the lastgasp status.  Processes can check the ring before the ring_manager has returned.

This should close that window, but still persist without the lastgasp status
Copy link
Contributor

@ThomasArts ThomasArts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuickCheck tests pass

@martinsumner martinsumner merged commit 868283c into develop-3.0 Nov 5, 2019
albsch pushed a commit to riak-core-lite/riak_core_lite that referenced this pull request Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants