Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wen_restart: Find the bank hash of the heaviest fork, replay if necessary. #420

Merged

Conversation

wen-coding
Copy link

Find the bank hash for the selected heaviest fork. Replay blocks if necessary and report error if any slot is missing.

@wen-coding wen-coding self-assigned this Mar 25, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2024

Codecov Report

Attention: Patch coverage is 93.12500% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (f6e02e6) to head (15fdef2).
Report is 22 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master     #420    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         848      848            
  Lines      229160   229292   +132     
========================================
+ Hits       187547   187681   +134     
+ Misses      41613    41611     -2     

assert_eq!(bank_forks.read().unwrap().banks().len(), 1);
if !opts.skip_bankforks_checks_for_wen_restart {
assert_eq!(bank_forks.read().unwrap().banks().len(), 1);
}
Copy link

@carllin carllin Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right entrypoint for the wen restart logic, hence the need to hack this flag.

Among other things, process_blockstore_from_root() will play every possible fork in blockstore, we just want to play the slots in a given list (the ones we repaired).

I think we should have a separate function that just calls process_single_slot() on every one of the repaired slots, should be much cleaner and simpler than repurposing this function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, re-reading Steve's comments, actually this will cause a problem, it may insert already present blocks. I'll change it, hold on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked, actually using process_single_slot is a good idea, I've changed it. Ready for another look.

Comment on lines 717 to 719
// Normally when we process blocks in blockstore we start from a snapshot where the bankforks only holds
// one bank. But during wen_restart we may need to replay the new blocks repaired, and the bankforks may
// hold any number of frozen banks.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the find heaviest bank logic only run a single time after all the forks are already repaired, and we pass in an empty bank forks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bank forks is reconstructed before entering wait_for_wenrestart, actually it's the result of process_blockstore_from_root() upon restart, so at this point the bank forks should be populated with banks already on the validator.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
@wen-coding wen-coding marked this pull request as draft March 27, 2024 23:14
@wen-coding wen-coding marked this pull request as ready for review March 29, 2024 02:37
@wen-coding wen-coding requested a review from carllin April 1, 2024 17:36
Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you asked me about the blockstore_processor stuff specifically. With the current state of things, I think you found the right entrypoint to utilize to strike the balance between code re-use and having separate entrypoints for different purposes (ie not trying to overload process_blockstore_from_root()

// When counting Heaviest Fork, only count those with no less than
// 67% - 5% - (100% - active_stake) = active_stake - 38% stake.
const HEAVIEST_FORK_THRESHOLD_DELTA: f64 = 0.38;

As someone who isn't actively working on consensus, I think this comment could benefit of an explanation of where the various terms (67 & 5) come from. Also, are these constants declared / duplicated anywhere else in the codebase ?


let leader_schedule_cache = LeaderScheduleCache::new_from_bank(&root_bank);
let replay_tx_thread_pool = rayon::ThreadPoolBuilder::new()
.num_threads(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably want to bump this thread count up:

  • I believe this will get called at a time when the system is otherwise not doing much (ie regular block production has stopped) so higher amount of parallelism is fine
  • For replaying full blocks, there is definite benefit to multiple threads. All of the entry verification (PoH and tx verification) can occur in parallel, and blockstore_processor will attempt to replay tx's in parallel as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I've changed it to default number of threads instead, which should be equal to the number of logical threads.

@@ -62,6 +70,9 @@ impl std::fmt::Display for WenRestartError {
WenRestartError::BlockNotFound(slot) => {
write!(f, "Block not found: {}", slot)
}
WenRestartError::BlockNotFrozenAfterReplay(slot, err) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the scope of this PR but mentioning it because I see it, but this error types looks like a good candidate for thiserror. This crate allows you to avoid having to write the custom formatter here:
https://docs.rs/thiserror/latest/thiserror/

Here is an example of an error type that I converted to use thiserror fairly recently:

#[derive(Error, Debug)]
pub enum BankForksUtilsError {
#[error("accounts path(s) not present when booting from snapshot")]
AccountPathsNotPresent,
#[error(
"failed to load bank: {source}, full snapshot archive: {full_snapshot_archive}, \
incremental snapshot archive: {incremental_snapshot_archive}"
)]
BankFromSnapshotsArchive {
source: snapshot_utils::SnapshotError,
full_snapshot_archive: String,
incremental_snapshot_archive: String,
},
#[error(
"there is no local state to startup from. \
Ensure --{flag} is NOT set to \"{value}\" and restart"
)]
NoBankSnapshotDirectory { flag: String, value: String },
#[error("failed to load bank: {source}, snapshot: {path}")]
BankFromSnapshotsDirectory {
source: snapshot_utils::SnapshotError,
path: PathBuf,
},
#[error("failed to process blockstore from root: {0}")]
ProcessBlockstoreFromRoot(#[source] BlockstoreProcessorError),
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did thiserror a while ago, but decided to switch to anyhow because we wanted backtrace:
solana-labs#33892 (comment)

Could I do thiserror and enable backtrace?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, I didn't know about anyhow / backtracing functionality, and didn't really notice that is what is in use in this crate.

Could I do thiserror and enable backtrace?

Good question and I'm not sure, would need to dig around or try a quick experiment

wen-restart/src/wen_restart.rs Show resolved Hide resolved
wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
if exit.load(Ordering::Relaxed) {
return Err(WenRestartError::Exiting.into());
}
if let Ok(Some(block_meta)) = blockstore.meta(slot) {
if let Ok(Some(block_meta)) = blockstore.meta(*slot) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is an unrecoverable error, but since we're already pulling the SlotMeta out, we could also check if the slot is full. Unless we'll be actively repairing / inserting shreds at this point in time, but this could be one of the causes of a bank not being frozen after replay below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

let recyclers = VerifyRecyclers::default();
let mut timing = ExecuteTimings::default();
let opts = ProcessOptions::default();
// Grab a big lock because we are the only person touching bankforks now.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big ? I guess big in the sense that we're holding it for a while?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

wen-restart/src/wen_restart.rs Show resolved Hide resolved
@wen-coding
Copy link
Author

I know you asked me about the blockstore_processor stuff specifically. With the current state of things, I think you found the right entrypoint to utilize to strike the balance between code re-use and having separate entrypoints for different purposes (ie not trying to overload process_blockstore_from_root()

// When counting Heaviest Fork, only count those with no less than
// 67% - 5% - (100% - active_stake) = active_stake - 38% stake.
const HEAVIEST_FORK_THRESHOLD_DELTA: f64 = 0.38;

As someone who isn't actively working on consensus, I think this comment could benefit of an explanation of where the various terms (67 & 5) come from. Also, are these constants declared / duplicated anywhere else in the codebase ?

Changed.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
Comment on lines 363 to 381
for slot in &slots {
if exit.load(Ordering::Relaxed) {
return Err(WenRestartError::Exiting.into());
}
if let Some(bank) = my_bankforks.get(*slot) {
if !bank.is_frozen() {
return Err(WenRestartError::BlockNotFrozenAfterReplay(*slot, None).into());
}
if bank.parent_slot() != parent_slot {
return Err(WenRestartError::BlockNotLinkedToExpectedParent(
*slot,
Some(bank.parent_slot()),
parent_slot,
)
.into());
}
}
parent_slot = *slot;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unncessary given the find_heaviest_fork() checks above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically some dump and repair logic could go between and silently change things from under us?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this phase was only run after all slots had already been repaired?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but could this happen?
repaired all -> entering find heaviest fork phase -> somehow a duplicate notification comes -> dump and repair -> oops some repaired block disappeared from under us.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should avoid duplicate notifications here, I think that would need to be handled outside this protocol

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, removed this check, we can maybe stop duplicate notification in another PR.

@wen-coding wen-coding merged commit 312f725 into anza-xyz:master Apr 7, 2024
49 checks passed
@wen-coding wen-coding deleted the wen_restart_get_heaviest_fork_hash branch April 8, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants