-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validator: Split PoH speed measurement from check #4185
base: master
Are you sure you want to change the base?
Changes from all commits
3d1b6b2
eb8acf8
5dc9a80
ac73c67
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,6 +152,12 @@ use { | |
tokio::runtime::Runtime as TokioRuntime, | ||
}; | ||
|
||
// The current number of hashes per second on Solana MNB, Testnet, and Devnet | ||
// If the hash rate on these clusters changes, we might consider updating this | ||
// constant. However, the PoH speed check compares hashes / second so this | ||
// constant does not have to be updated | ||
const POH_SPEED_CHECK_NUM_HASHES: u64 = 10_000_000; | ||
|
||
const MAX_COMPLETED_DATA_SETS_IN_CHANNEL: usize = 100_000; | ||
const WAIT_FOR_SUPERMAJORITY_THRESHOLD_PERCENT: u64 = 80; | ||
// Right now since we reuse the wait for supermajority code, the | ||
|
@@ -623,6 +629,29 @@ impl Validator { | |
|
||
let start_time = Instant::now(); | ||
|
||
if !ledger_path.is_dir() { | ||
return Err(anyhow!( | ||
"ledger directory does not exist or is not accessible: {ledger_path:?}" | ||
)); | ||
} | ||
let genesis_config = load_genesis(config, ledger_path)?; | ||
metrics_config_sanity_check(genesis_config.cluster_type)?; | ||
|
||
// Measure the PoH hash rate as early as possible to minimize | ||
// contention with other threads | ||
// | ||
// Skip the check for Development clusters as this will be the type for | ||
// tests or for clusters that we don't know what the rate will be | ||
// without a rebuilt Bank | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't the real reason that we don't want to pay the time cost for local cluster tests and such? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kind of, a couple things a
I can look at tweaking the comment a bit more |
||
let my_poh_hashes_per_second = | ||
match (genesis_config.cluster_type, !config.no_poh_speed_test) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. matching !no with false arm breaks my brain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ha, I initially did the following but cleared it let do_speed_test = !config.no_speed_test
let my_poh_hashes_per_second = match ... Naming a separate variable makes the not more visible so can switch it back if that's agreeable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ha, I'm cool with whatever |
||
(_, false) | (ClusterType::Development, _) => { | ||
info!("Skipping the PoH speed check"); | ||
None | ||
} | ||
(_, true) => Some(measure_poh_speed(POH_SPEED_CHECK_NUM_HASHES)), | ||
}; | ||
|
||
let thread_manager = ThreadManager::new(&config.thread_manager_config)?; | ||
// Initialize the global rayon pool first to ensure the value in config | ||
// is honored. Otherwise, some code accessing the global pool could | ||
|
@@ -703,14 +732,6 @@ impl Validator { | |
sigverify::init(); | ||
info!("Initializing sigverify done."); | ||
|
||
if !ledger_path.is_dir() { | ||
return Err(anyhow!( | ||
"ledger directory does not exist or is not accessible: {ledger_path:?}" | ||
)); | ||
} | ||
let genesis_config = load_genesis(config, ledger_path)?; | ||
metrics_config_sanity_check(genesis_config.cluster_type)?; | ||
|
||
info!("Cleaning accounts paths.."); | ||
*start_progress.write().unwrap() = ValidatorStartProgress::CleaningAccounts; | ||
let mut timer = Measure::start("clean_accounts_paths"); | ||
|
@@ -835,8 +856,11 @@ impl Validator { | |
) | ||
.map_err(ValidatorError::Other)?; | ||
|
||
if !config.no_poh_speed_test { | ||
check_poh_speed(&bank_forks.read().unwrap().root_bank(), None)?; | ||
if let Some(my_hashes_per_second) = my_poh_hashes_per_second { | ||
check_poh_speed( | ||
&bank_forks.read().unwrap().root_bank(), | ||
my_hashes_per_second, | ||
)?; | ||
} | ||
|
||
let (root_slot, hard_forks) = { | ||
|
@@ -1835,19 +1859,25 @@ fn active_vote_account_exists_in_bank(bank: &Bank, vote_account: &Pubkey) -> boo | |
false | ||
} | ||
|
||
fn check_poh_speed(bank: &Bank, maybe_hash_samples: Option<u64>) -> Result<(), ValidatorError> { | ||
// Compute the PoH speed (hashes / second) by measuring the time required to | ||
// compute `num_hashes` hashes | ||
fn measure_poh_speed(num_hashes: u64) -> u64 { | ||
let hash_time = compute_hash_time(num_hashes); | ||
(num_hashes as f64 / hash_time.as_secs_f64()) as u64 | ||
} | ||
|
||
// Compare a computed PoH speed against the target value derived from a Bank | ||
// and error if the computed PoH speed is less than the target speed | ||
// | ||
// Measurement and comparison are split so that the measurement can occur early | ||
// in validator startup before other services start competing for CPU time | ||
fn check_poh_speed(bank: &Bank, my_hashes_per_second: u64) -> Result<(), ValidatorError> { | ||
let Some(hashes_per_tick) = bank.hashes_per_tick() else { | ||
warn!("Unable to read hashes per tick from Bank, skipping PoH speed check"); | ||
return Ok(()); | ||
}; | ||
|
||
let ticks_per_slot = bank.ticks_per_slot(); | ||
let hashes_per_slot = hashes_per_tick * ticks_per_slot; | ||
let hash_samples = maybe_hash_samples.unwrap_or(hashes_per_slot); | ||
|
||
let hash_time = compute_hash_time(hash_samples); | ||
let my_hashes_per_second = (hash_samples as f64 / hash_time.as_secs_f64()) as u64; | ||
|
||
let target_slot_duration = Duration::from_nanos(bank.ns_per_slot as u64); | ||
let target_hashes_per_second = | ||
(hashes_per_slot as f64 / target_slot_duration.as_secs_f64()) as u64; | ||
|
@@ -3285,7 +3315,7 @@ mod tests { | |
..GenesisConfig::default() | ||
}; | ||
let bank = Bank::new_for_tests(&genesis_config); | ||
assert!(check_poh_speed(&bank, Some(10_000)).is_err()); | ||
assert!(check_poh_speed(&bank, measure_poh_speed(10_000)).is_err()); | ||
} | ||
|
||
#[test] | ||
|
@@ -3301,6 +3331,6 @@ mod tests { | |
..GenesisConfig::default() | ||
}; | ||
let bank = Bank::new_for_tests(&genesis_config); | ||
check_poh_speed(&bank, Some(10_000)).unwrap(); | ||
check_poh_speed(&bank, measure_poh_speed(10_000)).unwrap(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this in sync seems a little fraught. if we don't really care about the cluster value(s) and we're out of the critical path, would it make sense just just choose a much larger number so that we get a more accurate measure?
might make sense to eventually stuff this into a separate snapshot field so we can peek it out instead of waiting for a bank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A much larger number would hypothetically get us a more accurate number at the cost of time spent performing the calculation. Ie, a node that just meets spec would spent 1s hashing; 10x'ing this number 10x's that time spent (10s or 25 slots).
I think the main point of my comment was that since we normalize to hashes per second, keeping the number up to date isn't critical. Maybe the consideration would be if we hit 100M hashes/second then it 10M hashes (0.1) is too short of a period and we see significant jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the issue I called out in #4185 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reference
UPDATED_HASHES_PER_SECOND_6
? https://github.com/anza-xyz/agave/blob/master/sdk/clock/src/lib.rs#L61There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh yeah, and that is a public constant; I'll use that directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATED_HASHES_PER_SECOND_6
is a cluster configuration parameter and should by no means be public in the first place despite our rampant abuse of similar patterns. please do not add more cases that we need to fixyeah. i think local cluster is a scourge and needs to die as soon as possible. milady
i guess what i was getting at is that the comment is kinda ambivalent and like a "oh yeah hey maybe keep an eye on this" whereas ideally it'd either be direct and enforced with a static assertion or we choose a value that's sufficiently large to get us a good hashes/second measurement, regardless the cluster config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, although in our case, the "correct" value is pulled from a Bank at runtime. Nothing we can check at compile time so not much we can do here
Anecdotally, I think the max value I've seen in Discord is something like 20M-25M hashes per second. Assuming 25M hashes/second became the new cluster hash rate, 10M hashes is still 1 slot worth which is what the test previously ran with. So, I think the 10M hashes is a reasonable balance of do-not-spend-too-much-time and number-that-is-somewhat-future-proof (on the order of years given recent trajectory)