Fix sequencer drain in challenging situations #2673

AhmedSoliman · 2025-02-07T16:35:37Z

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-02-07T16:56:03Z

Test Results

7 files ±0 7 suites ±0 3m 25s ⏱️ -52s
45 tests - 2 44 ✅ - 2 1 💤 ±0 0 ❌ ±0
174 runs - 8 171 ✅ - 8 3 💤 ±0 0 ❌ ±0

Results for commit 92a6966. ± Comparison against base commit bb09c3d.

This pull request removes 2 tests.

dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks a lot for fixing this problem so quickly @AhmedSoliman. I would have cost me quite some sleep to not know what is causing the stuck situation. The changes look good to me. +1 for merging after fixing the failing bifrost_append_and_seal_concurrent test. From quickly skimming over it, it seems that the test assumes that Bifrost::append calls cannot fail (they should probably tolerate Error::Shutdown now).

crates/bifrost/src/providers/replicated_loglet/sequencer/appender.rs

tillrohrmann · 2025-02-07T21:37:27Z

crates/bifrost/src/providers/replicated_loglet/sequencer/appender.rs

@@ -279,6 +284,8 @@ impl<T: TransportConnect> SequencerAppender<T> {
            });
        }

+        // NOTE: It's very important to keep this look cancellation safe. If the appender future
+        // was cancelled, we don't want to move the global commit offset.


Unrelated to this PR: When seeing a StoreTaskStatus::Sealed could we exit the store_tasks.join_next() early if we have reached fmajority?

It only saves us, at best ~2s (the store timeout) but adds a f-majority check on every append which I'm not sure if it's worth the extra cost tbh. In the next PRs there will be some changes to this anyway to fix a few other issues.

crates/bifrost/src/providers/replicated_loglet/sequencer/appender.rs

tillrohrmann · 2025-02-07T21:43:24Z

crates/bifrost/src/providers/replicated_loglet/sequencer/appender.rs

@@ -149,7 +147,14 @@ impl<T: TransportConnect> SequencerAppender<T> {
                State::Done | State::Cancelled | State::Sealed => break state,
                State::Wave { graylist } => {
                    self.current_wave += 1;
-                    let Some(next_state) = cancellation
+                    // # Why is this cancellation safe?
+                    // Because we don't await any futures inside the join_next() loop, so we are


Why would it be a problem if we awaited futures inside the join_next() loop? If I understand things correctly, then the invariant that must hold is that we only acknowledge the write after we have replicated and updated the tail.

I'll try and explain it more in the comment. The invariant you mentioned if correct, but theoretically, if we have an await after updating the global tail, and still marked this append as cancelled, then other appenders after this one might be unblocked and finish their quorum write and therefore reporting success. This means that the writer has a hole in the log. This will not happen with the current one-by-one design, hence the "theoretical" bit.

AhmedSoliman · 2025-02-08T12:40:50Z

@tillrohrmann thanks for the review. Yes, your conclusion on the failing test is spot-on. Next PR will fix the test issue(s)

tillrohrmann mentioned this pull request Feb 7, 2025

Restate partition processor can become stuck after a period of network isolation ends #2655

Open

tillrohrmann approved these changes Feb 7, 2025

View reviewed changes

AhmedSoliman force-pushed the pr2673 branch from ee2e226 to c00c4a6 Compare February 8, 2025 13:16

AhmedSoliman mentioned this pull request Feb 8, 2025

Remote sequencer now handle cancelled/gone sequencers #2676

Merged

AhmedSoliman force-pushed the pr2673 branch from c00c4a6 to 8756621 Compare February 8, 2025 16:11

AhmedSoliman mentioned this pull request Feb 8, 2025

Fix treatment of not-found nodes and update provisioning semantics #2677

Merged

AhmedSoliman marked this pull request as ready for review February 8, 2025 16:12

AhmedSoliman force-pushed the pr2673 branch from 8756621 to 2a770cb Compare February 9, 2025 09:43

This was referenced Feb 9, 2025

[Bifrost] Trim improvements #2680

Merged

Fixes for trim interval and persisted lsn watcher #2683

Merged

Improve replicated-loglet restatectl commands #2681

Merged

[Bifrost] Allow repair stores on unsealed log-server #2682

Merged

AhmedSoliman force-pushed the pr2673 branch from 2a770cb to 92a6966 Compare February 10, 2025 09:08

This was referenced Feb 10, 2025

[Bifrost] Fix for tail-repair on previously trimmed logs #2689

Merged

[Bifrost] auto-drain the sequencer on exhausting offset limit #2690

Merged

AhmedSoliman force-pushed the pr2673 branch from 92a6966 to 6d4bdba Compare February 10, 2025 12:04

Fix sequencer drain in challenging situations

289249e

AhmedSoliman force-pushed the pr2673 branch from 6d4bdba to 289249e Compare February 10, 2025 12:10

AhmedSoliman merged commit 289249e into main Feb 10, 2025
4 of 5 checks passed

AhmedSoliman deleted the pr2673 branch February 10, 2025 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sequencer drain in challenging situations #2673

Fix sequencer drain in challenging situations #2673

AhmedSoliman commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Feb 7, 2025

AhmedSoliman Feb 8, 2025

tillrohrmann Feb 7, 2025

AhmedSoliman Feb 8, 2025

AhmedSoliman commented Feb 8, 2025

Fix sequencer drain in challenging situations #2673

Fix sequencer drain in challenging situations #2673

Conversation

AhmedSoliman commented Feb 7, 2025 • edited Loading

github-actions bot commented Feb 7, 2025 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Feb 7, 2025

Choose a reason for hiding this comment

AhmedSoliman Feb 8, 2025

Choose a reason for hiding this comment

tillrohrmann Feb 7, 2025

Choose a reason for hiding this comment

AhmedSoliman Feb 8, 2025

Choose a reason for hiding this comment

AhmedSoliman commented Feb 8, 2025

AhmedSoliman commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading