Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wen_restart: correctly handle HeaviestFork received from the coordinator. #2923

Merged

Conversation

wen-coding
Copy link

@wen-coding wen-coding commented Sep 13, 2024

  • After receiving HeaviestFork from the coordinator, repair missing slots and verify coordinator's choice
  • Send HeaviestFork no matter what the non-coordinator decides.
  • Make the coordinator print current stats every 10 seconds while aggregating HeaviestFork from others

@wen-coding wen-coding marked this pull request as draft September 13, 2024 17:10
@wen-coding wen-coding marked this pull request as ready for review September 17, 2024 07:38
@wen-coding wen-coding self-assigned this Sep 17, 2024
@wen-coding wen-coding marked this pull request as draft September 24, 2024 20:47
@wen-coding wen-coding changed the title wen_restart: Change HeaviestFork stage to a leader based approach. wen_restart: correctly handle HeaviestFork received from the coordinator. Oct 18, 2024
@wen-coding wen-coding marked this pull request as ready for review October 18, 2024 06:33
wen-restart/src/wen_restart.rs Show resolved Hide resolved
Comment on lines 940 to 941
// Wait for 10 seconds so the heaviest fork gets out.
sleep(Duration::from_secs(10));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems hacky, should probably either wait the actual gossip push interval or better, directly call something like cluster_info.flush_push_queue() to guarantee a push

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to one gossip push interval.

Copy link

@carllin carllin Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use flush_push_queue directly, timing based is less dependable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with Greg, flush_push_queue doesn't guarantee messages being sent out, it only moves messages from local queue to the crds table. It might be possible to expose a new interface to check things have actually be sent out, but let's do that in another PR maybe, so we can start invalidator testing for this one earlier.

},
)?;
WenRestartProgressInternalState::HeaviestFork {
new_root_slot: slot,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: just realized it's kind of confusing these are called root slots when they're not really root slots

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to final_restart_slot, does this sound better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it consistent and rename it to my_heaviest_fork_slot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
.into());
}
let my_bankhash = if !slots.is_empty() {
find_bankhash_of_heaviest_fork(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if the coordinator_heaviest_slot is accurate, our heaviest slot is accurate, but we've repaired blocks with the wrong ancestors, will find_bankhash_of_heaviest_fork just error out on replay with a dead slot?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_bankhash_of_heaviest_fork blindly follows the given slots to find the bankhash. If the given slots form a valid ancestor chain (checking it's frozen), a bankhash will be returned, then we can find out the bankhash is actually different.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah looks like it will just error out in process_single_slot with a replay error if the ancestor chain is wrong

@wen-coding wen-coding force-pushed the wen_restart_leader_for_heaviest_fork branch from 79159b2 to a34c297 Compare October 29, 2024 17:36
@wen-coding wen-coding merged commit d27761f into anza-xyz:master Nov 1, 2024
40 checks passed
@wen-coding wen-coding deleted the wen_restart_leader_for_heaviest_fork branch November 1, 2024 17:38
kevinheavey pushed a commit to kevinheavey/agave that referenced this pull request Nov 2, 2024
…tor. (anza-xyz#2923)

* wen_restart: Change HeaviestFork stage to a leader based approach.

* Fix logic to calculate slots to repair, add more logs.

* Let the leader generate snapshot and print error log before continuing to send out heaviest fork.

* Filter ancestors older than root.

* Reduce lock scope.

* Rename variables and functions.

* Make leader aggregate the heaviest fork of everyone.

* Move heaviest fork aggregation to DONE stage.

* No warning when receiving HeaviestFork from non-leader.

* Add test for receive_restart_heaviest_fork.

* Add test for repair_heaviest_fork.

* Add test for verify_leader_heaviest_fork.

* Rename wen_restart_leader to wen_restart_coordinator.

* Fix a bad merge

* Non-coordinator should use coodinator's (slot, hash) after verification.

* Remove trailing whitespace.

* Fix a bad merge

* Fix a bad merge.

* Remove unnecessary changes.

* Make coordinator select different slot in test, wait before exiting with error.

* Add send_and_receive_heaviest_fork and test.

* Make the coordinator print stats every 10 seconds.

* Rename variables and add comments.

* Rename final_restart_slot/hash with my_heaviest_fork_slot/hash

* flush_push_queue before waiting to speed things up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants