Figure out why raft quorum bug wasn't detected #28

cole-miller · 2022-09-15T15:35:11Z

Apparently our Jepsen tests weren't able to detect the bug in our implementation of the raft quorum logic that's addressed by canonical/raft#302. We should figure out why not, and strengthen the tests so that they successfully detect the bug.

cole-miller · 2022-09-16T21:32:50Z

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

freeekanayaka · 2022-09-17T07:51:16Z

That sounds plausible to me.

MathieuBordere · 2022-09-22T10:07:18Z

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

freeekanayaka · 2022-09-22T12:13:55Z

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

That sounds like a good idea, regardless of whether it will help triggering this specific bug. More than randomizing it, perhaps just setting it very high (e.g. 10x current value or more) and run all the tests with that high settings, as well as the normal default setting of course.

cole-miller self-assigned this Sep 21, 2022

MathieuBordere added bug Something isn't working enhancement New feature or request question Further information is requested labels Jun 12, 2023

cole-miller removed their assignment Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out why raft quorum bug wasn't detected #28

Figure out why raft quorum bug wasn't detected #28

cole-miller commented Sep 15, 2022

cole-miller commented Sep 16, 2022 •

edited

Loading

freeekanayaka commented Sep 17, 2022

MathieuBordere commented Sep 22, 2022

freeekanayaka commented Sep 22, 2022

Figure out why raft quorum bug wasn't detected #28

Figure out why raft quorum bug wasn't detected #28

Comments

cole-miller commented Sep 15, 2022

cole-miller commented Sep 16, 2022 • edited Loading

freeekanayaka commented Sep 17, 2022

MathieuBordere commented Sep 22, 2022

freeekanayaka commented Sep 22, 2022

cole-miller commented Sep 16, 2022 •

edited

Loading