-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wen_restart: correctly handle HeaviestFork received from the coordinator. #2923
wen_restart: correctly handle HeaviestFork received from the coordinator. #2923
Conversation
wen-coding
commented
Sep 13, 2024
•
edited
Loading
edited
- After receiving HeaviestFork from the coordinator, repair missing slots and verify coordinator's choice
- Send HeaviestFork no matter what the non-coordinator decides.
- Make the coordinator print current stats every 10 seconds while aggregating HeaviestFork from others
…g to send out heaviest fork.
wen-restart/src/wen_restart.rs
Outdated
// Wait for 10 seconds so the heaviest fork gets out. | ||
sleep(Duration::from_secs(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems hacky, should probably either wait the actual gossip push interval or better, directly call something like cluster_info.flush_push_queue()
to guarantee a push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to one gossip push interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use flush_push_queue directly, timing based is less dependable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with Greg, flush_push_queue doesn't guarantee messages being sent out, it only moves messages from local queue to the crds table. It might be possible to expose a new interface to check things have actually be sent out, but let's do that in another PR maybe, so we can start invalidator testing for this one earlier.
wen-restart/src/wen_restart.rs
Outdated
}, | ||
)?; | ||
WenRestartProgressInternalState::HeaviestFork { | ||
new_root_slot: slot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: just realized it's kind of confusing these are called root slots when they're not really root slots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to final_restart_slot, does this sound better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep it consistent and rename it to my_heaviest_fork_slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
.into()); | ||
} | ||
let my_bankhash = if !slots.is_empty() { | ||
find_bankhash_of_heaviest_fork( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens here if the coordinator_heaviest_slot
is accurate, our heaviest slot is accurate, but we've repaired blocks with the wrong ancestors, will find_bankhash_of_heaviest_fork
just error out on replay with a dead slot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_bankhash_of_heaviest_fork
blindly follows the given slots to find the bankhash. If the given slots form a valid ancestor chain (checking it's frozen), a bankhash will be returned, then we can find out the bankhash is actually different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah looks like it will just error out in process_single_slot
with a replay error if the ancestor chain is wrong
79159b2
to
a34c297
Compare
…tor. (anza-xyz#2923) * wen_restart: Change HeaviestFork stage to a leader based approach. * Fix logic to calculate slots to repair, add more logs. * Let the leader generate snapshot and print error log before continuing to send out heaviest fork. * Filter ancestors older than root. * Reduce lock scope. * Rename variables and functions. * Make leader aggregate the heaviest fork of everyone. * Move heaviest fork aggregation to DONE stage. * No warning when receiving HeaviestFork from non-leader. * Add test for receive_restart_heaviest_fork. * Add test for repair_heaviest_fork. * Add test for verify_leader_heaviest_fork. * Rename wen_restart_leader to wen_restart_coordinator. * Fix a bad merge * Non-coordinator should use coodinator's (slot, hash) after verification. * Remove trailing whitespace. * Fix a bad merge * Fix a bad merge. * Remove unnecessary changes. * Make coordinator select different slot in test, wait before exiting with error. * Add send_and_receive_heaviest_fork and test. * Make the coordinator print stats every 10 seconds. * Rename variables and add comments. * Rename final_restart_slot/hash with my_heaviest_fork_slot/hash * flush_push_queue before waiting to speed things up.