Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator: Split PoH speed measurement from check #4185

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 49 additions & 19 deletions core/src/validator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,12 @@ use {
tokio::runtime::Runtime as TokioRuntime,
};

// The current number of hashes per second on Solana MNB, Testnet, and Devnet
// If the hash rate on these clusters changes, we might consider updating this
// constant. However, the PoH speed check compares hashes / second so this
// constant does not have to be updated
const POH_SPEED_CHECK_NUM_HASHES: u64 = 10_000_000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this in sync seems a little fraught. if we don't really care about the cluster value(s) and we're out of the critical path, would it make sense just just choose a much larger number so that we get a more accurate measure?

might make sense to eventually stuff this into a separate snapshot field so we can peek it out instead of waiting for a bank

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A much larger number would hypothetically get us a more accurate number at the cost of time spent performing the calculation. Ie, a node that just meets spec would spent 1s hashing; 10x'ing this number 10x's that time spent (10s or 25 slots).

I think the main point of my comment was that since we normalize to hashes per second, keeping the number up to date isn't critical. Maybe the consideration would be if we hit 100M hashes/second then it 10M hashes (0.1) is too short of a period and we see significant jitter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the issue I called out in #4185 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reference UPDATED_HASHES_PER_SECOND_6? https://github.com/anza-xyz/agave/blob/master/sdk/clock/src/lib.rs#L61

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yeah, and that is a public constant; I'll use that directly

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATED_HASHES_PER_SECOND_6 is a cluster configuration parameter and should by no means be public in the first place despite our rampant abuse of similar patterns. please do not add more cases that we need to fix

Any thoughts on the issue I called out in #4185 (comment)

yeah. i think local cluster is a scourge and needs to die as soon as possible. milady

A much larger number would hypothetically get us a more accurate number at the cost of time spent performing the calculation. Ie, a node that just meets spec would spent 1s hashing; 10x'ing this number 10x's that time spent (10s or 25 slots).

I think the main point of my comment was that since we normalize to hashes per second, keeping the number up to date isn't critical. Maybe the consideration would be if we hit 100M hashes/second then it 10M hashes (0.1) is too short of a period and we see significant jitter.

i guess what i was getting at is that the comment is kinda ambivalent and like a "oh yeah hey maybe keep an eye on this" whereas ideally it'd either be direct and enforced with a static assertion or we choose a value that's sufficiently large to get us a good hashes/second measurement, regardless the cluster config

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it'd either be direct and enforced with a static assertion

Agreed, although in our case, the "correct" value is pulled from a Bank at runtime. Nothing we can check at compile time so not much we can do here

or we choose a value that's sufficiently large to get us a good hashes/second measurement, regardless the cluster config

Anecdotally, I think the max value I've seen in Discord is something like 20M-25M hashes per second. Assuming 25M hashes/second became the new cluster hash rate, 10M hashes is still 1 slot worth which is what the test previously ran with. So, I think the 10M hashes is a reasonable balance of do-not-spend-too-much-time and number-that-is-somewhat-future-proof (on the order of years given recent trajectory)


const MAX_COMPLETED_DATA_SETS_IN_CHANNEL: usize = 100_000;
const WAIT_FOR_SUPERMAJORITY_THRESHOLD_PERCENT: u64 = 80;
// Right now since we reuse the wait for supermajority code, the
Expand Down Expand Up @@ -623,6 +629,29 @@ impl Validator {

let start_time = Instant::now();

if !ledger_path.is_dir() {
return Err(anyhow!(
"ledger directory does not exist or is not accessible: {ledger_path:?}"
));
}
let genesis_config = load_genesis(config, ledger_path)?;
metrics_config_sanity_check(genesis_config.cluster_type)?;

// Measure the PoH hash rate as early as possible to minimize
// contention with other threads
//
// Skip the check for Development clusters as this will be the type for
// tests or for clusters that we don't know what the rate will be
// without a rebuilt Bank

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the real reason that we don't want to pay the time cost for local cluster tests and such?
Because we end up pulling the rate from the root bank later

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of, a couple things a

  • Unit tests are run in debug mode ... doing 10M hashes in debug mode on local machine, potentially for multiple tests in parallel is not a good time
  • You're correct that we pull the value from the Bank later, and because of that, we don't necessarily want to spend the time to hash 10M hashes (in hopefully a second) if the genesis config dictates we only need to be able to hit 100k.

I can look at tweaking the comment a bit more

let my_poh_hashes_per_second =
match (genesis_config.cluster_type, !config.no_poh_speed_test) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matching !no with false arm breaks my brain

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I initially did the following but cleared it

let do_speed_test = !config.no_speed_test
let my_poh_hashes_per_second = match ...

Naming a separate variable makes the not more visible so can switch it back if that's agreeable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, I'm cool with whatever

(_, false) | (ClusterType::Development, _) => {
info!("Skipping the PoH speed check");
None
}
(_, true) => Some(measure_poh_speed(POH_SPEED_CHECK_NUM_HASHES)),
};

let thread_manager = ThreadManager::new(&config.thread_manager_config)?;
// Initialize the global rayon pool first to ensure the value in config
// is honored. Otherwise, some code accessing the global pool could
Expand Down Expand Up @@ -703,14 +732,6 @@ impl Validator {
sigverify::init();
info!("Initializing sigverify done.");

if !ledger_path.is_dir() {
return Err(anyhow!(
"ledger directory does not exist or is not accessible: {ledger_path:?}"
));
}
let genesis_config = load_genesis(config, ledger_path)?;
metrics_config_sanity_check(genesis_config.cluster_type)?;

info!("Cleaning accounts paths..");
*start_progress.write().unwrap() = ValidatorStartProgress::CleaningAccounts;
let mut timer = Measure::start("clean_accounts_paths");
Expand Down Expand Up @@ -835,8 +856,11 @@ impl Validator {
)
.map_err(ValidatorError::Other)?;

if !config.no_poh_speed_test {
check_poh_speed(&bank_forks.read().unwrap().root_bank(), None)?;
if let Some(my_hashes_per_second) = my_poh_hashes_per_second {
check_poh_speed(
&bank_forks.read().unwrap().root_bank(),
my_hashes_per_second,
)?;
}

let (root_slot, hard_forks) = {
Expand Down Expand Up @@ -1835,19 +1859,25 @@ fn active_vote_account_exists_in_bank(bank: &Bank, vote_account: &Pubkey) -> boo
false
}

fn check_poh_speed(bank: &Bank, maybe_hash_samples: Option<u64>) -> Result<(), ValidatorError> {
// Compute the PoH speed (hashes / second) by measuring the time required to
// compute `num_hashes` hashes
fn measure_poh_speed(num_hashes: u64) -> u64 {
let hash_time = compute_hash_time(num_hashes);
(num_hashes as f64 / hash_time.as_secs_f64()) as u64
}

// Compare a computed PoH speed against the target value derived from a Bank
// and error if the computed PoH speed is less than the target speed
//
// Measurement and comparison are split so that the measurement can occur early
// in validator startup before other services start competing for CPU time
fn check_poh_speed(bank: &Bank, my_hashes_per_second: u64) -> Result<(), ValidatorError> {
let Some(hashes_per_tick) = bank.hashes_per_tick() else {
warn!("Unable to read hashes per tick from Bank, skipping PoH speed check");
return Ok(());
};

let ticks_per_slot = bank.ticks_per_slot();
let hashes_per_slot = hashes_per_tick * ticks_per_slot;
let hash_samples = maybe_hash_samples.unwrap_or(hashes_per_slot);

let hash_time = compute_hash_time(hash_samples);
let my_hashes_per_second = (hash_samples as f64 / hash_time.as_secs_f64()) as u64;

let target_slot_duration = Duration::from_nanos(bank.ns_per_slot as u64);
let target_hashes_per_second =
(hashes_per_slot as f64 / target_slot_duration.as_secs_f64()) as u64;
Expand Down Expand Up @@ -3285,7 +3315,7 @@ mod tests {
..GenesisConfig::default()
};
let bank = Bank::new_for_tests(&genesis_config);
assert!(check_poh_speed(&bank, Some(10_000)).is_err());
assert!(check_poh_speed(&bank, measure_poh_speed(10_000)).is_err());
}

#[test]
Expand All @@ -3301,6 +3331,6 @@ mod tests {
..GenesisConfig::default()
};
let bank = Bank::new_for_tests(&genesis_config);
check_poh_speed(&bank, Some(10_000)).unwrap();
check_poh_speed(&bank, measure_poh_speed(10_000)).unwrap();
}
}
Loading