-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in probe state after force leave #129
Comments
It is not obvious how the It does seem though, that the It might be that we can fix this for |
Trying to chase what happens with an update from riak_core_claimant to ensemble: Either a join or leave are sent as separate requests using There is a check that there is no attempt to remove or add itself:
This is then converted via riak_ensemble/src/riak_ensemble_manager.erl Line 275 in 0303d7e
On receiving a join call, the node the be added must request the remote cluster state of the claimant node (the node already in the cluster) to check if join allowed: riak_ensemble/src/riak_ensemble_manager.erl Lines 315 to 317 in 0303d7e
The riak_ensemble/src/riak_ensemble_manager.erl Lines 521 to 532 in 0303d7e
The joining node (manager) is then given the same cluster_state as the claimant node it is joining: riak_ensemble/src/riak_ensemble_manager.erl Line 391 in 0303d7e
This state is saved, and we call riak_ensemble/src/riak_ensemble_manager.erl Line 322 in 0303d7e
Once there is a reply form the join, this node will prompt all the state changes required in the local ETS table and the local peers (adding and deleting as required): riak_ensemble/src/riak_ensemble_manager.erl Lines 610 to 641 in 0303d7e
This is based on the remote state of the claimant node (State2) i.e. it does not currently reflect the join. I'm not sure what peers can therefore be created at this stage, as none of the ensembles in the remote state will have changed to reflect the join. The remove is relatively simple: riak_ensemble/src/riak_ensemble_manager.erl Lines 335 to 338 in 0303d7e
The call to The 'riak_ensemble_root:join/1' will result in a call to the passed in Node of claimant (the one in the cluster), with the message of the call containing the local node. The In other words riak_ensemble root and join both lead to calls to a node in the cluster with a message containing a reference to the node joining or leaving. riak_ensemble/src/riak_ensemble_root.erl Lines 47 to 65 in 0303d7e
The call will result in: riak_ensemble/src/riak_ensemble_root.erl Lines 78 to 84 in 0303d7e
Which will send an event: riak_ensemble/src/riak_ensemble_peer.erl Line 297 in 0303d7e
The target here is the atom root and not a pid(), so this will lead to:
Which we can unpack as:
the Target is still root and not a pid, so this results in a call to
This is going to cast to Node: riak_ensemble/src/riak_ensemble_router.erl Line 131 in 0303d7e
Where the Node is the Node which is joining or leaving. The ensemble is root. There appear to be 7 routers which have bene generated, so the code next picks one of these routers in the pool at random, to use as a destination for a riak_ensemble/src/riak_ensemble_router.erl Line 139 in 0303d7e
This appears to be just and The message being sent was
riak_ensemble/src/riak_ensemble_router.erl Lines 219 to 235 in 0303d7e
The purpose of this appears to be to find the leader of the root ensemble - be it on this node, or any other node - and forward message to the If we assume that peer is in the riak_ensemble/src/riak_ensemble_peer.erl Lines 715 to 721 in 0303d7e
This will then select a worker from a pool (the riak_ensemble/src/riak_ensemble_peer.erl Lines 1286 to 1289 in 0303d7e
The riak_ensemble/src/riak_ensemble_peer.erl Lines 1369 to 1416 in 0303d7e
At this stage: Key is The PUT starts with an riak_ensemble/src/riak_ensemble_peer.erl Lines 340 to 341 in 0303d7e
The The hash is in fact a binary containing the epoch and sqn for the object. Note that this must equal the equivalent information in the store. Note the comments here - the AAE store (synctree) is different in purpose to the usual KV AAE store. There is then a check that the epoch of the object is the same as the epoch in the fact in the state of the peer. If riak_ensemble/src/riak_ensemble_peer.erl Lines 1566 to 1567 in 0303d7e
Assuming that the store is up to date, then the key can be modified using do_modify_fsm/6 which is a function that sends a reply after obtaining the result of Included in the Args of riak_ensemble/src/riak_ensemble_peer.erl Lines 1602 to 1608 in 0303d7e
In this case being:
The next object sequence number is the one returned from this function: riak_ensemble/src/riak_ensemble_peer.erl Lines 1776 to 1791 in 0303d7e
The do_kmodify function which was passed in is then going to in turn function within its arguments to get the updated object to be stored: riak_ensemble/src/riak_ensemble_peer.erl Lines 302 to 309 in 0303d7e
riak_ensemble/src/riak_ensemble_root.erl Lines 110 to 111 in 0303d7e
riak_ensemble/src/riak_ensemble_root.erl Lines 123 to 138 in 0303d7e
With a modified object the actual put can now be performed: riak_ensemble/src/riak_ensemble_peer.erl Lines 1670 to 1698 in 0303d7e
The There will then be put message events sent to all the peers in this ensemble, and a local_put message will be sent back to this peer (the leader). The put message will require the peers to be in the following state, and the the local_put will require the peer to be in the leading state Both local_put and put will result in a Mod:put. In riak_kv, the Mod is https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_ensemble_backend.erl#L147-L149 The backend itself is not doing the storage, it is forwarding an If the vnode is forwarding, it will forward the put, otherwise it will do an actual put: https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L2105-L2125 If a handoff target it is specified it will do a raw_put to the handoff node: https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_vnode.erl#L2248-L2252 Note that in one riak_test run of the After completing the PUT, the root leader needs to send and updated hash to all sync'd trees to make sure that the AAE state is universally updated with the object state. A reply is then sent back with the updated object (the cluster state). Although the object is just ignored: riak_ensemble/src/riak_ensemble_root.erl Lines 86 to 87 in 0303d7e
So this has updated the root ensemble's stored view of the cluster state. However, I don't believe at this stage it has actually done anything with regards to updating the operational state to reflect the change. This must be prompted via the state_changed function in Stage (3), which is called after all these ensemble updates have completed. This calculates any peer changes, and requests them via the peer supervisor: riak_ensemble/src/riak_ensemble_manager.erl Lines 697 to 737 in 0303d7e
.... work in progress ... still more to come |
Looking at examples of the test failing, having added some logs, the JOIN and LEAVE requests are all see, and all report success: test log:
dev1:
dev2:
dev1:
dev3:
test log:
dev1:
dev4:
dev5:
dev6:
test log:
dev7:
dev1:
dev3 (which now appears to be the claimant):
test log:
Note here is the fault - quorum is never ready dev3:
dev8:
dev1:
It is odd that there are two remove messages for the final force replace. This appears to be consistent across different runs of the test |
The full update cycle in the previous comment appears to be completed. Some extra logging has been added in to Initial cluster formed dev1:
dev2:
dev3:
Cluster expanded dev1:
dev4:
Replace dev4:
dev7:
Force Replace dev2:
So these final 2 peer changes are wrong. The dev2 node is being removed here, and it tries to add itself twice to the ensemble as it is closing down. This happens before the join and the leave messages associated with the force replace .. and these join and leave messages do not generate any peer changes associated with this ensemble. |
These additions occur, and not prompted by a join/leave message, whilst riak is in the process of shutting down on this node:
|
This is tested with the However, there is no join or leave message until 6s later (presumably at the next schedule tick). So I don't believe these peer changes are as a result of riak_core triggering them. |
The following ensemble_changes are logged on node2 at this point:
|
Then on dev3 (and other nodes) there are Ensemble changes after the initial join/remove messages:
|
The fact that we see ensemble changes but not peer changes after the join/remove from the force_replace perhaps explains why the The peer changes generated by the replaced node may have advanced the |
There is an extra log in the leaving state on riak_ensemble_peer that generates this:
Something is calling update_members here, that is not prompted by a join or leave on |
This is it I think. the There are two potential problems here:
|
The tick on the backend is very aggressive. So wait 20 ticks of a stable ring before reacting. It would be preferable if other processes react to significant ring changes because of basho/riak_ensemble#129. Perhaps this shouldn't do anything at all?
There is also another process in riak_kv monitoring the ring fro ensemble: I don't know why there is a need for three separate ticks to monitor here. Appear to have stability if:
|
The riak_test
ensemble_ring_changes
attempts to test ensemble when subject to various cluster admin changes. Unfortunately the original test didn't wait for some of the cluster operations to complete before making its assertions, and so the test would pass without the scenario actually being proven.There is now an updated version of this test, which fails most of the time - https://github.com/basho/riak_test/pulls.
The failure occurs after the
force-replace
operation. After this some ensembles are stuck in the probe state. Generally the situation is as follows:There are peers probing, but those peers think that there is a quorum of nodes on a now unreachable node (the one which has left under the
force-leave
).The probe state constantly returns with timeout errors (unavailable nodes lead to immediate nacks - that re then interpreted as a quorum for timeout), the peer then does a probe delay and re-probe.
The
riak_ensemble_manager
has a more correct view (i.e. one that doesn't show ensemble peers on the unavailable node) and this is checked after the probe has failed.The
riak_ensemble_manager
has a lower{epoch, sqn}
than that of the peer view - so the peers persist in using its incorrect view and the cycle of probe failures continue.Despite there being no nodes down, multiple ensembles are un usable.
This is true on both the current
develop-3.0
branch as well as thedevelop-3.0-lastgaspring
branch which corrects the issue in #128.The text was updated successfully, but these errors were encountered: