-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out why raft quorum bug wasn't detected #28
Comments
It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it? |
That sounds plausible to me. |
We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs. |
That sounds like a good idea, regardless of whether it will help triggering this specific bug. More than randomizing it, perhaps just setting it very high (e.g. 10x current value or more) and run all the tests with that high settings, as well as the normal default setting of course. |
Apparently our Jepsen tests weren't able to detect the bug in our implementation of the raft quorum logic that's addressed by canonical/raft#302. We should figure out why not, and strengthen the tests so that they successfully detect the bug.
The text was updated successfully, but these errors were encountered: