Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use branch with extra tracing for scheduled runs #133

Closed

Conversation

cole-miller
Copy link
Contributor

Trying to track down canonical/raft#470

Signed-off-by: Cole Miller [email protected]

@cole-miller
Copy link
Contributor Author

Hey, I hit the jackpot even without merging! https://github.com/canonical/jepsen.dqlite/actions/runs/6112143030/job/16588982093

So just looking at the history of index 513 in the in-memory log, there are some oddities. An entry for term 1 is inserted and then acquired several times, bumping the refcount up to 6. Then there is a call to logRelease(513), but the refcount stays at 6 -- that seems like a bug. After that there are two more calls to logRelease(513), and one to logTruncate(513), which reduce the refcount to 3 -- where it stays until the problematic call to logAppend. So in addition to the logRelease call that does nothing, it seems like there are some logAcquire that aren't paired with a logRelease.

But the final logAppend is also weird, because it seems to use the wrong term number. It's in response to this incoming request, after n1 has stepped down as leader (due to failing to write entry 513 to disk) and n5 has become leader for term 2:

LIBRAFT   1694103739905718314 src/recv_append_entries.c:37 self:3297041220608546238 from:[email protected]:8081 leader_commit:512 n_entries:2 prev_log_index:512 prev_log_term:1, term:2

So the new log entry we try to insert at index 513 should definitely have term 2, but it seems we use term 1 instead!

refsTryInsert(term=1, index=513, count=1)
comparing next_slot->term=1 with term=1

So I would guess there's at least two bugs here.

@MathieuBordere
Copy link
Contributor

ooph, sounds funky

@cole-miller
Copy link
Contributor Author

cole-miller commented Sep 21, 2023

But the final logAppend is also weird, because it seems to use the wrong term number. It's in response to this incoming request, after n1 has stepped down as leader (due to failing to write entry 513 to disk) and n5 has become leader for term 2:

[snip]

So the new log entry we try to insert at index 513 should definitely have term 2, but it seems we use term 1 instead!

I was mistaken about this. The deal is that n5 has the original entry 513 because n1 successfully sent it before noticing its own disk write failure. So the term number for that entry in the AppendEntries request from n5 to n1 in the following term is correct; it's just that n1 still has a trace of that same entry in its in-memory log.

It's a little tricky to see exactly where we should be decrementing a refcount for the log but aren't; still working on that.

@cole-miller
Copy link
Contributor Author

cole-miller commented Sep 21, 2023

I believe the solution to the "mystery of the missing decrements" is that we acquire the entry in question several times to send it to other nodes, but not all of those sends have completed by the time we notice the disk write failure and revert to follower, or even by the time we get the append request from the new leader. So there's a logRelease still waiting around to be called by sendAppendEntriesCb at that point.

I'm not sure yet whether it's practical to cancel all outstanding AppendEntries sends in the event that a leader's disk write fails -- this crosses the raft/raft_io boundary -- but that would be the obvious way to approach this.

@cole-miller
Copy link
Contributor Author

Reflecting on this some more, I'm not sure the cancellation part is essential -- we could equally just let all the AppendEntries requests finish sending normally. The core here is having the outgoing leader wait for all pending sends to complete before stepping down. From that perspective, this issue is similar to others we've been dealing with recently.

I'm wondering if there are other workarounds that don't involve delaying the leader->follower transition. If we accepted that this kind of situation can happen and just removed the offending assertion, in this particular case I don't think any further errors would result -- after all, the log entry that's referenced by the in-memory log and the one that's being received over the wire are identical. But that assertion does serve a purpose in case we somehow get into the situation where the two entries don't match (which would indicate a serious breakdown of our consensus implementation). I think replication.c and log.c could be smarter about distinguishing these two situations, though.

@cole-miller
Copy link
Contributor Author

I think there is a fundamental issue that the in-memory log data structure can't store more than one log entry with a given index. That means we can't straightforwardly resolve the collision between a log entry that we failed to write to disk, but that's sticking around in the in-memory log while we try to send it to other nodes, and an incoming log entry with the same index sent by a new leader; they both want to be stored in the same slot in memory. We might just need to upgrade that data structure to support parallel storage for entries that are not going to make it to disk, but that are referenced by outgoing AppendEntries requests.

Any kind of cancellation seems pretty hard to implement, because uv_send on a TCP socket can't be cancelled, and we try to avoid copying the buffers with the entry data.

I'm also not convinced that waiting to step down as leader is a complete solution for this general problem. It seems like basically the same situation could arise without manually stepping down at all, if we're partitioned away from the rest of the cluster. Upgrading our in-memory log storage seems like the most robust way to go here.

@cole-miller
Copy link
Contributor Author

I've also been thinking about how this ties into a feature request we got to handle disk write failures on the leader more gracefully, by handing over voting rights and sending TimeoutNow instead of moving straight to follower. If we did that, we could view the situation on the leader after a disk write fails as similar/ideantical to the situation after starting a handover (in the sense of server-side role management).

@cole-miller
Copy link
Contributor Author

cole-miller commented Sep 26, 2023

Possible fix here: canonical/raft#483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants