Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wen_restart: Make validator restart and wait for supermajority after … #1335

Merged

Conversation

wen-coding
Copy link

@wen-coding wen-coding commented May 14, 2024

After wen_restart completes:

  • Change state to Done and write to the snapshot file on disk
  • Prompt operators to change the argument from --wen_restart to --wait_for_supermajority with correct slot/hash/shred_version
  • Make validator restart with exit code 200

@wen-coding wen-coding self-assigned this May 14, 2024
@wen-coding wen-coding requested review from carllin and AshwinSekar May 14, 2024 06:59
@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

Attention: Patch coverage is 95.11401% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 82.8%. Comparing base (d1afd78) to head (7d83c77).
Report is 7 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #1335    +/-   ##
========================================
  Coverage    82.8%    82.8%            
========================================
  Files         867      867            
  Lines      368897   369158   +261     
========================================
+ Hits       305591   305910   +319     
+ Misses      63306    63248    -58     


pub fn wen_restart_wait_for_supermajority_bank_id(
records_path: &Option<PathBuf>,
) -> Result<(Option<Slot>, Option<Hash>)> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to just Result<Option<(Slot, Hash)>>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

pub fn wen_restart_wait_for_supermajority_bank_id(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bank_id has a different connotation, because there's an actually an id field in Bank.

wen_restart_wait_for_supermajority_slot_hash() seems more apt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2383,8 +2401,16 @@ fn wait_for_supermajority(
sleep(Duration::new(1, 0));
}
rpc_override_health_check.store(false, Ordering::Relaxed);
if config.wait_for_supermajority.is_none() {
if let Some(proto_path) = &config.wen_restart_proto_path {
if let Err(e) = wen_restart_mark_done(proto_path) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in order to trigger another wen restart in the future, operators will have to manually delete/move this path?

Seems like we should make this smoother to automatically handle the restart process for another wen restart

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we need two back to back wen_restart in the future, I'd be really worried.

This is similar to --wait_for_supermajority, you have to specify --wen_restart on command line with your desired proto path. So we can suggest that you put date in the proto path, so you never have to delete the file. After wen_restart finishes correctly you should remove --wen_restart from your command line, as you do in --wait_for_supermajority. You can keep it there forever, the proto path will have state DONE so it will do nothing. During next --wen_restart we just use a new path with different dates so we don't mix data from two wen_restarts together.

Comment on lines 1362 to 1365
return Err(
"wen_restart phase one completed, will restart to wait for supermajority"
.to_string(),
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that the validator process will catch this error and auto-restart? Having to match on an exact string seems very brittle, an error type seems more robust

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is just FYI for worrying operators, telling them this restart is normal, you shouldn't worry, the rest will still happen automatically.

let in_wen_restart = config.wen_restart_proto_path.is_some() && !waited_for_supermajority;
let wen_restart_repair_slots = if in_wen_restart {
let in_wen_restart_phase_one = !waited_for_supermajority
&& wen_restart_is_in_phase_one(&config.wen_restart_proto_path);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean after wen restart is marked done by wait_for_supermajority the operator will manually have to delete/move the wen_restart_proto_path in order to trigger another wen restart?

Might be good at the start of the validator to immediately exit the validator if the progress path is marked as Done already, then i think this check is unnecessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the rest is automatic, we are just sharing code path with --wait_for_supermajority so we don't need to write the code again.

We proceed when the progress path is marked as DONE so if some operators are too lazy and don't remove --wen_restart from command line, their validators will still behave normally, so we don't trigger a cluster-wide validator deaths when enough people don't remove --wen_restart after wen restart is done.

None => Ok(false),
// wait_for_supermajority can be triggered either through commandline
// or when wen_restart is in WAIT_FOR_SUPERMAJORITY mode.
let (wait_for_supermajority, expected_bank_hash) = match config.wait_for_supermajority {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're expecting the validator process to auto restart after phase 1, seems reasonable to just have that top level logic pass in wait_for_supermajority instead of having to detect it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, there is not other top level logic, it's just --wen_restart wen_restart_proto_path controlling the two phases. We don't want operators to wake up at night to change command line args.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it more and I think your suggestion makes sense, changed.

@carllin
Copy link

carllin commented May 28, 2024

I think we need to think through how we imagine the operator will handle the signals from/during the wen restart, and which portion will be automated/require manual intervention.

This PR right now I think implies

  1. Operator has to manually change validator config to trigger a wen restart (MANUAL)
  2. After the wen restart protocol runs and figures the (slot, hash) and generates a snapshot, it marks the wen restart progress as Status::WAIT_FOR_SUPERMAJORITY. the validator will send an error that will be caught by the validator process and automatically restart the validator. (AUTOMATIC).
  3. Restarted validator will read the wen restart config file, see WAIT_FOR_SUPERMAJORITY and enter wait_for_supermajority logic. Not convinced here we need this, seems like in 2 we can just automatically pass in the wait_for_supermajority with the right slot and hash (AUTOMATIC)
  4. After successful wait_for_supermajority, the wen restart config file is marked Status::DONE and is essentially ignored. In order to do another wen restart, the operator will have to move/delete this config file (MANUAL). Seems like there should be a better way to handle this part.

Might be good to take a call with @t-nelson to hammer out these details and then update the docs

@wen-coding
Copy link
Author

I think we need to think through how we imagine the operator will handle the signals from/during the wen restart, and which portion will be automated/require manual intervention.

This PR right now I think implies

  1. Operator has to manually change validator config to trigger a wen restart (MANUAL)
  2. After the wen restart protocol runs and figures the (slot, hash) and generates a snapshot, it marks the wen restart progress as Status::WAIT_FOR_SUPERMAJORITY. the validator will send an error that will be caught by the validator process and automatically restart the validator. (AUTOMATIC).
  3. Restarted validator will read the wen restart config file, see WAIT_FOR_SUPERMAJORITY and enter wait_for_supermajority logic. Not convinced here we need this, seems like in 2 we can just automatically pass in the wait_for_supermajority with the right slot and hash (AUTOMATIC)

We talked about this before. The problem is that I changed repair and stopped vote and some other logic already at this point. You and I chatted with Trent about the possibility of continuing without a hard restart in the middle, the conclusion is that it's too brittle, and we rather want a hard restart to reset everything.

  1. After successful wait_for_supermajority, the wen restart config file is marked Status::DONE and is essentially ignored. In order to do another wen restart, the operator will have to move/delete this config file (MANUAL). Seems like there should be a better way to handle this part.

Well, if you want another wen restart, you should manually change the proto path because now we said starting wen_restart is an intentional choice by operators. We can put the date in proto path, I would be really worried if we need two wen_restarts in the same day.

Might be good to take a call with @t-nelson to hammer out these details and then update the docs

@wen-coding wen-coding requested a review from t-nelson June 4, 2024 18:17
@wen-coding wen-coding marked this pull request as draft June 4, 2024 21:45
@anza-xyz anza-xyz deleted a comment Jun 8, 2024
@wen-coding wen-coding marked this pull request as ready for review August 14, 2024 22:50
@wen-coding
Copy link
Author

The dependencies of this PR have been merged, this one is now ready for another look, thanks.

@wen-coding wen-coding requested a review from carllin August 19, 2024 16:02
@wen-coding wen-coding merged commit c842b0d into anza-xyz:master Aug 25, 2024
40 checks passed
@wen-coding wen-coding deleted the wen_restart_wait_for_supermajority branch August 25, 2024 00:38
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
anza-xyz#1335)

* wen_restart: Make validator restart and wait for supermajority after Phase One completes.

* Fix cargo sort error.

* Adding two unittests.

* Change function name and return type.

* Don't change wait_for_supermajority, try to re-initialzie validator after wen_restart phase one.

* Fix a bad merge and fix test.

* Make validator exit if wen_restart is finished.

* Remove unnecessary dependencies from Cargo.toml

* Remove unused function.

* Fix tests to check OK() is returned when in DONE state.

* Add wen_restart_proto_path if specified in command line.

* Fix command line name for wen_restart.

* Log how much stake is required to exit.

* Don't double count my own RestartLastVotedForkSlots.

* Ignore HeaviestFork from myself.

* Also send out HeaviestForkAggregate if I saw supermajority for the first
time.

* Forbid replay if wen_restart is in progress.

* We still need the new bank part.

* Try not grabbing a big write lock on bankforks.

* Should read GenerateSnapshot in initialize as well.

* Skip all replay stages while in wen_restart.

* Root banks if necessary to send EAH request.

* Root banks every 100 slot.

* Do not start replay thread if in wen_restart.

* Do not need to check in_wen_restart inside replay_stage any more.

* Fix failed merge.

* Make linter happy.

* Fix the bad merge after switching to anyhow.

* Remove unused code.

* Fix bad merge.

* Remove unnecessary clone.

* Remove unused map_error.

* No need to specify snapshot_path in restart commandline.

* Small fixes.

* Split the enabling of wen_restart into another PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants