Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sequencer drain in challenging situations #2673

Merged
merged 1 commit into from
Feb 10, 2025
Merged

Fix sequencer drain in challenging situations #2673

merged 1 commit into from
Feb 10, 2025

Conversation

Copy link

github-actions bot commented Feb 7, 2025

Test Results

  7 files  ±0    7 suites  ±0   3m 25s ⏱️ -52s
 45 tests  - 2   44 ✅  - 2  1 💤 ±0  0 ❌ ±0 
174 runs   - 8  171 ✅  - 8  3 💤 ±0  0 ❌ ±0 

Results for commit 92a6966. ± Comparison against base commit bb09c3d.

This pull request removes 2 tests.
dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this problem so quickly @AhmedSoliman. I would have cost me quite some sleep to not know what is causing the stuck situation. The changes look good to me. +1 for merging after fixing the failing bifrost_append_and_seal_concurrent test. From quickly skimming over it, it seems that the test assumes that Bifrost::append calls cannot fail (they should probably tolerate Error::Shutdown now).

@@ -279,6 +284,8 @@ impl<T: TransportConnect> SequencerAppender<T> {
});
}

// NOTE: It's very important to keep this look cancellation safe. If the appender future
// was cancelled, we don't want to move the global commit offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: When seeing a StoreTaskStatus::Sealed could we exit the store_tasks.join_next() early if we have reached fmajority?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only saves us, at best ~2s (the store timeout) but adds a f-majority check on every append which I'm not sure if it's worth the extra cost tbh. In the next PRs there will be some changes to this anyway to fix a few other issues.

@@ -149,7 +147,14 @@ impl<T: TransportConnect> SequencerAppender<T> {
State::Done | State::Cancelled | State::Sealed => break state,
State::Wave { graylist } => {
self.current_wave += 1;
let Some(next_state) = cancellation
// # Why is this cancellation safe?
// Because we don't await any futures inside the join_next() loop, so we are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be a problem if we awaited futures inside the join_next() loop? If I understand things correctly, then the invariant that must hold is that we only acknowledge the write after we have replicated and updated the tail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try and explain it more in the comment. The invariant you mentioned if correct, but theoretically, if we have an await after updating the global tail, and still marked this append as cancelled, then other appenders after this one might be unblocked and finish their quorum write and therefore reporting success. This means that the writer has a hole in the log. This will not happen with the current one-by-one design, hence the "theoretical" bit.

@AhmedSoliman
Copy link
Contributor Author

@tillrohrmann thanks for the review. Yes, your conclusion on the failing test is spot-on. Next PR will fix the test issue(s)

@AhmedSoliman AhmedSoliman merged commit 289249e into main Feb 10, 2025
4 of 5 checks passed
@AhmedSoliman AhmedSoliman deleted the pr2673 branch February 10, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants