wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

wen-coding · 2024-09-13T17:10:35Z

After receiving HeaviestFork from the coordinator, repair missing slots and verify coordinator's choice
Send HeaviestFork no matter what the non-coordinator decides.
Make the coordinator print current stats every 10 seconds while aggregating HeaviestFork from others

…g to send out heaviest fork.

…ith error.

wen-restart/src/wen_restart.rs

carllin · 2024-10-19T05:05:20Z

wen-restart/src/wen_restart.rs

+                // Wait for 10 seconds so the heaviest fork gets out.
+                sleep(Duration::from_secs(10));


This seems hacky, should probably either wait the actual gossip push interval or better, directly call something like cluster_info.flush_push_queue() to guarantee a push

Changed to one gossip push interval.

why not use flush_push_queue directly, timing based is less dependable.

Chatted with Greg, flush_push_queue doesn't guarantee messages being sent out, it only moves messages from local queue to the crds table. It might be possible to expose a new interface to check things have actually be sent out, but let's do that in another PR maybe, so we can start invalidator testing for this one earlier.

carllin · 2024-10-19T05:18:43Z

wen-restart/src/wen_restart.rs

+                    },
+                )?;
+                WenRestartProgressInternalState::HeaviestFork {
+                    new_root_slot: slot,


side note: just realized it's kind of confusing these are called root slots when they're not really root slots

renamed to final_restart_slot, does this sound better?

let's keep it consistent and rename it to my_heaviest_fork_slot

wen-restart/src/wen_restart.rs

carllin · 2024-10-19T05:29:50Z

wen-restart/src/wen_restart.rs

+        .into());
+    }
+    let my_bankhash = if !slots.is_empty() {
+        find_bankhash_of_heaviest_fork(


what happens here if the coordinator_heaviest_slot is accurate, our heaviest slot is accurate, but we've repaired blocks with the wrong ancestors, will find_bankhash_of_heaviest_fork just error out on replay with a dead slot?

find_bankhash_of_heaviest_fork blindly follows the given slots to find the bankhash. If the given slots form a valid ancestor chain (checking it's frozen), a bankhash will be returned, then we can find out the bankhash is actually different.

ok yeah looks like it will just error out in process_single_slot with a replay error if the ancestor chain is wrong

…tor. (anza-xyz#2923) * wen_restart: Change HeaviestFork stage to a leader based approach. * Fix logic to calculate slots to repair, add more logs. * Let the leader generate snapshot and print error log before continuing to send out heaviest fork. * Filter ancestors older than root. * Reduce lock scope. * Rename variables and functions. * Make leader aggregate the heaviest fork of everyone. * Move heaviest fork aggregation to DONE stage. * No warning when receiving HeaviestFork from non-leader. * Add test for receive_restart_heaviest_fork. * Add test for repair_heaviest_fork. * Add test for verify_leader_heaviest_fork. * Rename wen_restart_leader to wen_restart_coordinator. * Fix a bad merge * Non-coordinator should use coodinator's (slot, hash) after verification. * Remove trailing whitespace. * Fix a bad merge * Fix a bad merge. * Remove unnecessary changes. * Make coordinator select different slot in test, wait before exiting with error. * Add send_and_receive_heaviest_fork and test. * Make the coordinator print stats every 10 seconds. * Rename variables and add comments. * Rename final_restart_slot/hash with my_heaviest_fork_slot/hash * flush_push_queue before waiting to speed things up.

wen_restart: Change HeaviestFork stage to a leader based approach.

bcc98af

wen-coding marked this pull request as draft September 13, 2024 17:10

wen-coding added 11 commits September 13, 2024 14:52

Fix logic to calculate slots to repair, add more logs.

0f929b3

Let the leader generate snapshot and print error log before continuin…

63f2bb1

…g to send out heaviest fork.

Filter ancestors older than root.

0590c8e

Reduce lock scope.

ac5c6b1

Rename variables and functions.

0a5efe8

Make leader aggregate the heaviest fork of everyone.

0545333

Move heaviest fork aggregation to DONE stage.

f719c5b

No warning when receiving HeaviestFork from non-leader.

6f82a36

Add test for receive_restart_heaviest_fork.

4fdc228

Add test for repair_heaviest_fork.

f915803

Add test for verify_leader_heaviest_fork.

31395a7

wen-coding marked this pull request as ready for review September 17, 2024 07:38

wen-coding self-assigned this Sep 17, 2024

wen-coding requested review from carllin and AshwinSekar September 17, 2024 07:38

Rename wen_restart_leader to wen_restart_coordinator.

9147dc9

wen-coding marked this pull request as draft September 24, 2024 20:47

wen-coding added 11 commits October 7, 2024 20:03

Merge branch 'master' into wen_restart_leader_for_heaviest_fork

05928aa

Fix a bad merge

3142433

Non-coordinator should use coodinator's (slot, hash) after verification.

8dfa0a3

Remove trailing whitespace.

85ed8d1

Merge branch 'master' into wen_restart_leader_for_heaviest_fork

208b2c7

Merge branch 'master' into wen_restart_leader_for_heaviest_fork

8ea01f7

Fix a bad merge

ba540c0

Merge branch 'master' into wen_restart_leader_for_heaviest_fork

2132509

Fix a bad merge.

21f77b6

Remove unnecessary changes.

cda571e

Make coordinator select different slot in test, wait before exiting w…

0ec3fc6

…ith error.

wen-coding added 2 commits October 17, 2024 22:10

Add send_and_receive_heaviest_fork and test.

e18e8f7

Make the coordinator print stats every 10 seconds.

6772287

wen-coding changed the title ~~wen_restart: Change HeaviestFork stage to a leader based approach.~~ wen_restart: correctly handle HeaviestFork received from the coordinator. Oct 18, 2024

wen-coding marked this pull request as ready for review October 18, 2024 06:33

carllin reviewed Oct 19, 2024

View reviewed changes

Rename variables and add comments.

a34c297

wen-coding force-pushed the wen_restart_leader_for_heaviest_fork branch from 79159b2 to a34c297 Compare October 29, 2024 17:36

wen-coding added 3 commits October 29, 2024 10:50

Rename final_restart_slot/hash with my_heaviest_fork_slot/hash

be7c1ed

flush_push_queue before waiting to speed things up.

e7efe5c

Merge branch 'master' into wen_restart_leader_for_heaviest_fork

3fc5a4b

carllin approved these changes Nov 1, 2024

View reviewed changes

wen-coding merged commit d27761f into anza-xyz:master Nov 1, 2024
40 checks passed

wen-coding deleted the wen_restart_leader_for_heaviest_fork branch November 1, 2024 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

wen-coding commented Sep 13, 2024 •

edited

Loading

carllin Oct 19, 2024

wen-coding Oct 19, 2024

carllin Oct 29, 2024 •

edited

Loading

wen-coding Oct 29, 2024

carllin Oct 19, 2024

wen-coding Oct 19, 2024

carllin Oct 29, 2024

wen-coding Oct 29, 2024

carllin Oct 19, 2024

wen-coding Oct 19, 2024

carllin Oct 29, 2024

		// Wait for 10 seconds so the heaviest fork gets out.
		sleep(Duration::from_secs(10));

wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

Conversation

wen-coding commented Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wen-coding commented Sep 13, 2024 •

edited

Loading

carllin Oct 29, 2024 •

edited

Loading