-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wen_restart: Make validator restart and wait for supermajority after … #1335
wen_restart: Make validator restart and wait for supermajority after … #1335
Conversation
…Phase One completes.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1335 +/- ##
========================================
Coverage 82.8% 82.8%
========================================
Files 867 867
Lines 368897 369158 +261
========================================
+ Hits 305591 305910 +319
+ Misses 63306 63248 -58 |
wen-restart/src/wen_restart.rs
Outdated
|
||
pub fn wen_restart_wait_for_supermajority_bank_id( | ||
records_path: &Option<PathBuf>, | ||
) -> Result<(Option<Slot>, Option<Hash>)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to just Result<Option<(Slot, Hash)>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
wen-restart/src/wen_restart.rs
Outdated
} | ||
} | ||
|
||
pub fn wen_restart_wait_for_supermajority_bank_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bank_id
has a different connotation, because there's an actually an id field in Bank
.
wen_restart_wait_for_supermajority_slot_hash()
seems more apt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/src/validator.rs
Outdated
@@ -2383,8 +2401,16 @@ fn wait_for_supermajority( | |||
sleep(Duration::new(1, 0)); | |||
} | |||
rpc_override_health_check.store(false, Ordering::Relaxed); | |||
if config.wait_for_supermajority.is_none() { | |||
if let Some(proto_path) = &config.wen_restart_proto_path { | |||
if let Err(e) = wen_restart_mark_done(proto_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in order to trigger another wen restart in the future, operators will have to manually delete/move this path?
Seems like we should make this smoother to automatically handle the restart process for another wen restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we need two back to back wen_restart in the future, I'd be really worried.
This is similar to --wait_for_supermajority, you have to specify --wen_restart on command line with your desired proto path. So we can suggest that you put date in the proto path, so you never have to delete the file. After wen_restart finishes correctly you should remove --wen_restart from your command line, as you do in --wait_for_supermajority. You can keep it there forever, the proto path will have state DONE so it will do nothing. During next --wen_restart we just use a new path with different dates so we don't mix data from two wen_restarts together.
core/src/validator.rs
Outdated
return Err( | ||
"wen_restart phase one completed, will restart to wait for supermajority" | ||
.to_string(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that the validator process will catch this error and auto-restart? Having to match on an exact string seems very brittle, an error type seems more robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is just FYI for worrying operators, telling them this restart is normal, you shouldn't worry, the rest will still happen automatically.
core/src/validator.rs
Outdated
let in_wen_restart = config.wen_restart_proto_path.is_some() && !waited_for_supermajority; | ||
let wen_restart_repair_slots = if in_wen_restart { | ||
let in_wen_restart_phase_one = !waited_for_supermajority | ||
&& wen_restart_is_in_phase_one(&config.wen_restart_proto_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean after wen restart is marked done by wait_for_supermajority
the operator will manually have to delete/move the wen_restart_proto_path
in order to trigger another wen restart?
Might be good at the start of the validator to immediately exit the validator if the progress path is marked as Done
already, then i think this check is unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, the rest is automatic, we are just sharing code path with --wait_for_supermajority so we don't need to write the code again.
We proceed when the progress path is marked as DONE so if some operators are too lazy and don't remove --wen_restart from command line, their validators will still behave normally, so we don't trigger a cluster-wide validator deaths when enough people don't remove --wen_restart after wen restart is done.
core/src/validator.rs
Outdated
None => Ok(false), | ||
// wait_for_supermajority can be triggered either through commandline | ||
// or when wen_restart is in WAIT_FOR_SUPERMAJORITY mode. | ||
let (wait_for_supermajority, expected_bank_hash) = match config.wait_for_supermajority { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're expecting the validator process to auto restart after phase 1, seems reasonable to just have that top level logic pass in wait_for_supermajority
instead of having to detect it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean, there is not other top level logic, it's just --wen_restart wen_restart_proto_path controlling the two phases. We don't want operators to wake up at night to change command line args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it more and I think your suggestion makes sense, changed.
I think we need to think through how we imagine the operator will handle the signals from/during the wen restart, and which portion will be automated/require manual intervention. This PR right now I think implies
Might be good to take a call with @t-nelson to hammer out these details and then update the docs |
We talked about this before. The problem is that I changed repair and stopped vote and some other logic already at this point. You and I chatted with Trent about the possibility of continuing without a hard restart in the middle, the conclusion is that it's too brittle, and we rather want a hard restart to reset everything.
Well, if you want another wen restart, you should manually change the proto path because now we said starting wen_restart is an intentional choice by operators. We can put the date in proto path, I would be really worried if we need two wen_restarts in the same day.
|
…fter wen_restart phase one.
The dependencies of this PR have been merged, this one is now ready for another look, thanks. |
anza-xyz#1335) * wen_restart: Make validator restart and wait for supermajority after Phase One completes. * Fix cargo sort error. * Adding two unittests. * Change function name and return type. * Don't change wait_for_supermajority, try to re-initialzie validator after wen_restart phase one. * Fix a bad merge and fix test. * Make validator exit if wen_restart is finished. * Remove unnecessary dependencies from Cargo.toml * Remove unused function. * Fix tests to check OK() is returned when in DONE state. * Add wen_restart_proto_path if specified in command line. * Fix command line name for wen_restart. * Log how much stake is required to exit. * Don't double count my own RestartLastVotedForkSlots. * Ignore HeaviestFork from myself. * Also send out HeaviestForkAggregate if I saw supermajority for the first time. * Forbid replay if wen_restart is in progress. * We still need the new bank part. * Try not grabbing a big write lock on bankforks. * Should read GenerateSnapshot in initialize as well. * Skip all replay stages while in wen_restart. * Root banks if necessary to send EAH request. * Root banks every 100 slot. * Do not start replay thread if in wen_restart. * Do not need to check in_wen_restart inside replay_stage any more. * Fix failed merge. * Make linter happy. * Fix the bad merge after switching to anyhow. * Remove unused code. * Fix bad merge. * Remove unnecessary clone. * Remove unused map_error. * No need to specify snapshot_path in restart commandline. * Small fixes. * Split the enabling of wen_restart into another PR.
After wen_restart completes: