New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

wen_restart: Find the bank hash of the heaviest fork, replay if necessary. #420

Merged

wen-coding merged 15 commits into anza-xyz:master from wen-coding:wen_restart_get_heaviest_fork_hash

Apr 7, 2024

wen-coding commented Mar 25, 2024

Find the bank hash for the selected heaviest fork. Replay blocks if necessary and report error if any slot is missing.


          Find the bank hash of the heaviest fork, replay if necessary.

91bf064

wen-coding self-assigned this

wen-coding requested review from steviez, carllin and AshwinSekar

March 25, 2024 19:36

codecov-commenter commented Mar 25, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 93.12500% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (f6e02e6) to head (15fdef2).
Report is 22 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master     #420    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         848      848            
  Lines      229160   229292   +132     
========================================
+ Hits       187547   187681   +134     
+ Misses      41613    41611     -2


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

carllin reviewed

View reviewed changes

ledger/src/blockstore_processor.rs Outdated

-                      assert_eq!(bank_forks.read().unwrap().banks().len(), 1);
+                      if !opts.skip_bankforks_checks_for_wen_restart {
+                          assert_eq!(bank_forks.read().unwrap().banks().len(), 1);
+                      }

carllin Mar 27, 2024 •

edited

Loading

This doesn't seem like the right entrypoint for the wen restart logic, hence the need to hack this flag.

Among other things, process_blockstore_from_root() will play every possible fork in blockstore, we just want to play the slots in a given list (the ones we repaired).

I think we should have a separate function that just calls process_single_slot() on every one of the repaired slots, should be much cleaner and simpler than repurposing this function

Author

wen-coding Mar 27, 2024

Hmm, re-reading Steve's comments, actually this will cause a problem, it may insert already present blocks. I'll change it, hold on.

Author

wen-coding Mar 29, 2024

Re-checked, actually using process_single_slot is a good idea, I've changed it. Ready for another look.

ledger/src/blockstore_processor.rs Outdated

Comment on lines 717 to 719

+                  // Normally when we process blocks in blockstore we start from a snapshot where the bankforks only holds
+                  // one bank. But during wen_restart we may need to replay the new blocks repaired, and the bankforks may
+                  // hold any number of frozen banks.

carllin Mar 27, 2024

Isn't the find heaviest bank logic only run a single time after all the forks are already repaired, and we pass in an empty bank forks?

Author

wen-coding Mar 27, 2024

The bank forks is reconstructed before entering wait_for_wenrestart, actually it's the result of process_blockstore_from_root() upon restart, so at this point the bank forks should be populated with banks already on the validator.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved


          Make it more explicit how heaviest fork slot is selected.

7c2c2c9

wen-coding marked this pull request as draft

March 27, 2024 23:14

wen-coding added 2 commits

March 27, 2024 16:15


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

c8832a3


          Use process_single_slot instead of process_blockstore_from_root, the …

8cae20b

…latter

may re-insert banks already frozen.

wen-coding marked this pull request as ready for review

March 29, 2024 02:37

wen-coding added 2 commits

March 28, 2024 19:37


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

ede7893


          Put BlockstoreProcessError into the error message.

2a8de06

wen-coding requested a review from carllin

April 1, 2024 17:36


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

559b9de

carllin reviewed

View reviewed changes

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved


          Check that all existing blocks link to correct parent before replay.

c0da1ef

steviez reviewed

View reviewed changes

steviez left a comment •

edited

Loading

I know you asked me about the blockstore_processor stuff specifically. With the current state of things, I think you found the right entrypoint to utilize to strike the balance between code re-use and having separate entrypoints for different purposes (ie not trying to overload process_blockstore_from_root()

// When counting Heaviest Fork, only count those with no less than
// 67% - 5% - (100% - active_stake) = active_stake - 38% stake.
const HEAVIEST_FORK_THRESHOLD_DELTA: f64 = 0.38;

As someone who isn't actively working on consensus, I think this comment could benefit of an explanation of where the various terms (67 & 5) come from. Also, are these constants declared / duplicated anywhere else in the codebase ?

wen-restart/src/wen_restart.rs Outdated

+                  let leader_schedule_cache = LeaderScheduleCache::new_from_bank(&root_bank);
+                  let replay_tx_thread_pool = rayon::ThreadPoolBuilder::new()
+                      .num_threads(1)

steviez Apr 2, 2024

I think you probably want to bump this thread count up:

I believe this will get called at a time when the system is otherwise not doing much (ie regular block production has stopped) so higher amount of parallelism is fine
For replaying full blocks, there is definite benefit to multiple threads. All of the entry verification (PoH and tx verification) can occur in parallel, and blockstore_processor will attempt to replay tx's in parallel as well

Author

wen-coding Apr 2, 2024

That's fair, I've changed it to default number of threads instead, which should be equal to the number of logical threads.

wen-restart/src/wen_restart.rs

@@ @@ -62,6 +70,9 @@ impl std::fmt::Display for WenRestartError { @@
                           WenRestartError::BlockNotFound(slot) => {
                               write!(f, "Block not found: {}", slot)
                           }
+                          WenRestartError::BlockNotFrozenAfterReplay(slot, err) => {

steviez Apr 2, 2024

Outside of the scope of this PR but mentioning it because I see it, but this error types looks like a good candidate for thiserror. This crate allows you to avoid having to write the custom formatter here:
https://docs.rs/thiserror/latest/thiserror/

Here is an example of an error type that I converted to use thiserror fairly recently:

agave/ledger/src/bank_forks_utils.rs

Lines 34 to 63 in c59143b

    
           #[derive(Error, Debug)] 
        
           pub enum BankForksUtilsError { 
        
               #[error("accounts path(s) not present when booting from snapshot")] 
        
               AccountPathsNotPresent, 
        
               #[error( 
        
                   "failed to load bank: {source}, full snapshot archive: {full_snapshot_archive}, \ 
        
                   incremental snapshot archive: {incremental_snapshot_archive}" 
        
               )] 
        
               BankFromSnapshotsArchive { 
        
                   source: snapshot_utils::SnapshotError, 
        
                   full_snapshot_archive: String, 
        
                   incremental_snapshot_archive: String, 
        
               }, 
        
               #[error( 
        
                   "there is no local state to startup from. \ 
        
                   Ensure --{flag} is NOT set to \"{value}\" and restart" 
        
               )] 
        
               NoBankSnapshotDirectory { flag: String, value: String }, 
        
               #[error("failed to load bank: {source}, snapshot: {path}")] 
        
               BankFromSnapshotsDirectory { 
        
                   source: snapshot_utils::SnapshotError, 
        
                   path: PathBuf, 
        
               }, 
        
               #[error("failed to process blockstore from root: {0}")] 
        
               ProcessBlockstoreFromRoot(#[source] BlockstoreProcessorError), 
        
           }

Author

wen-coding Apr 2, 2024

We did thiserror a while ago, but decided to switch to anyhow because we wanted backtrace:
solana-labs#33892 (comment)

Could I do thiserror and enable backtrace?

steviez Apr 2, 2024

Oh neat, I didn't know about anyhow / backtracing functionality, and didn't really notice that is what is in use in this crate.

Could I do thiserror and enable backtrace?

Good question and I'm not sure, would need to dig around or try a quick experiment

wen-restart/src/wen_restart.rs Show resolved Hide resolved

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved

wen-restart/src/wen_restart.rs

                       if exit.load(Ordering::Relaxed) {
                           return Err(WenRestartError::Exiting.into());
                       }
-                      if let Ok(Some(block_meta)) = blockstore.meta(slot) {
+                      if let Ok(Some(block_meta)) = blockstore.meta(*slot) {

steviez Apr 2, 2024

Not sure if it is an unrecoverable error, but since we're already pulling the SlotMeta out, we could also check if the slot is full. Unless we'll be actively repairing / inserting shreds at this point in time, but this could be one of the causes of a bank not being frozen after replay below

Author

wen-coding Apr 2, 2024

Changed.

wen-restart/src/wen_restart.rs Outdated

+                  let recyclers = VerifyRecyclers::default();
+                  let mut timing = ExecuteTimings::default();
+                  let opts = ProcessOptions::default();
+                  // Grab a big lock because we are the only person touching bankforks now.

steviez Apr 2, 2024

big ? I guess big in the sense that we're holding it for a while?

Author

wen-coding Apr 2, 2024

Changed.

wen-restart/src/wen_restart.rs Show resolved Hide resolved

wen-coding added 2 commits

April 2, 2024 10:53


          Use the default number of threads instead.

2dd8373


          Check whether block is full and other small fixes.

7db78b5

Author

wen-coding commented Apr 2, 2024

I know you asked me about the blockstore_processor stuff specifically. With the current state of things, I think you found the right entrypoint to utilize to strike the balance between code re-use and having separate entrypoints for different purposes (ie not trying to overload process_blockstore_from_root()

// When counting Heaviest Fork, only count those with no less than
// 67% - 5% - (100% - active_stake) = active_stake - 38% stake.
const HEAVIEST_FORK_THRESHOLD_DELTA: f64 = 0.38;

As someone who isn't actively working on consensus, I think this comment could benefit of an explanation of where the various terms (67 & 5) come from. Also, are these constants declared / duplicated anywhere else in the codebase ?

Changed.

carllin reviewed

View reviewed changes

wen-restart/src/wen_restart.rs Show resolved Hide resolved

carllin reviewed

View reviewed changes

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved

wen-restart/src/wen_restart.rs Outdated

Comment on lines 363 to 381

+                  for slot in &slots {
+                      if exit.load(Ordering::Relaxed) {
+                          return Err(WenRestartError::Exiting.into());
                       }
+                      if let Some(bank) = my_bankforks.get(*slot) {
+                          if !bank.is_frozen() {
+                              return Err(WenRestartError::BlockNotFrozenAfterReplay(*slot, None).into());
+                          }
+                          if bank.parent_slot() != parent_slot {
+                              return Err(WenRestartError::BlockNotLinkedToExpectedParent(
+                                  *slot,
+                                  Some(bank.parent_slot()),
+                                  parent_slot,
+                              )
+                              .into());
+                          }
+                      }
+                      parent_slot = *slot;
                   }

carllin Apr 3, 2024

This should be unncessary given the find_heaviest_fork() checks above

Author

wen-coding Apr 3, 2024

Theoretically some dump and repair logic could go between and silently change things from under us?

carllin Apr 4, 2024

I thought this phase was only run after all slots had already been repaired?

Author

wen-coding Apr 4, 2024

Yes, but could this happen?
repaired all -> entering find heaviest fork phase -> somehow a duplicate notification comes -> dump and repair -> oops some repaired block disappeared from under us.

carllin Apr 5, 2024

we should avoid duplicate notifications here, I think that would need to be handled outside this protocol

Author

wen-coding Apr 6, 2024

Okay, removed this check, we can maybe stop duplicate notification in another PR.

wen-coding added 4 commits

April 3, 2024 13:50


          Fix root_bank and move comments to function level.

49f3d24


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

59eba61


          Merge branch 'master' into wen_restart_get_heaviest_fork_hash

21ddf08


          Remove the extra parent link check.

15fdef2

carllin approved these changes

View reviewed changes

wen-coding merged commit 312f725 into anza-xyz:master

49 checks passed

wen-coding deleted the wen_restart_get_heaviest_fork_hash branch

April 8, 2024 16:19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet