-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validator: Split PoH speed measurement from check #4185
base: master
Are you sure you want to change the base?
Conversation
e851f08
to
be4ac16
Compare
Checking the PoH speed requires a Bank to derive the cluster hash rate. By the time a Bank is available, many services have started up and are competing for CPU time. So, split the PoH speed check. Measure the speed very early on in Validator::new(), and check the value later once a Bank is available
be4ac16
to
3d1b6b2
Compare
The change shows clear improvement for production case. However, it is problematic for test situations such as A few alternatives I see:
I'm leaning towards the option to use 10M for our clusters and fallback to genesis for |
// If the hash rate on these clusters changes, we might consider updating this | ||
// constant. However, the PoH speed check compares hashes / second so this | ||
// constant does not have to be updated | ||
const POH_SPEED_CHECK_NUM_HASHES: u64 = 10_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this in sync seems a little fraught. if we don't really care about the cluster value(s) and we're out of the critical path, would it make sense just just choose a much larger number so that we get a more accurate measure?
might make sense to eventually stuff this into a separate snapshot field so we can peek it out instead of waiting for a bank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A much larger number would hypothetically get us a more accurate number at the cost of time spent performing the calculation. Ie, a node that just meets spec would spent 1s hashing; 10x'ing this number 10x's that time spent (10s or 25 slots).
I think the main point of my comment was that since we normalize to hashes per second, keeping the number up to date isn't critical. Maybe the consideration would be if we hit 100M hashes/second then it 10M hashes (0.1) is too short of a period and we see significant jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the issue I called out in #4185 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reference UPDATED_HASHES_PER_SECOND_6
? https://github.com/anza-xyz/agave/blob/master/sdk/clock/src/lib.rs#L61
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh yeah, and that is a public constant; I'll use that directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATED_HASHES_PER_SECOND_6
is a cluster configuration parameter and should by no means be public in the first place despite our rampant abuse of similar patterns. please do not add more cases that we need to fix
Any thoughts on the issue I called out in #4185 (comment)
yeah. i think local cluster is a scourge and needs to die as soon as possible. milady
A much larger number would hypothetically get us a more accurate number at the cost of time spent performing the calculation. Ie, a node that just meets spec would spent 1s hashing; 10x'ing this number 10x's that time spent (10s or 25 slots).
I think the main point of my comment was that since we normalize to hashes per second, keeping the number up to date isn't critical. Maybe the consideration would be if we hit 100M hashes/second then it 10M hashes (0.1) is too short of a period and we see significant jitter.
i guess what i was getting at is that the comment is kinda ambivalent and like a "oh yeah hey maybe keep an eye on this" whereas ideally it'd either be direct and enforced with a static assertion or we choose a value that's sufficiently large to get us a good hashes/second measurement, regardless the cluster config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally it'd either be direct and enforced with a static assertion
Agreed, although in our case, the "correct" value is pulled from a Bank at runtime. Nothing we can check at compile time so not much we can do here
or we choose a value that's sufficiently large to get us a good hashes/second measurement, regardless the cluster config
Anecdotally, I think the max value I've seen in Discord is something like 20M-25M hashes per second. Assuming 25M hashes/second became the new cluster hash rate, 10M hashes is still 1 slot worth which is what the test previously ran with. So, I think the 10M hashes is a reasonable balance of do-not-spend-too-much-time and number-that-is-somewhat-future-proof (on the order of years given recent trajectory)
I'm fine just skipping the PoH check for I have mixed thoughts on the overall decoupling...
|
We could potentially adjust the speed check to create a new thread, set affinity, run speed check, and immediately join on that thread. This would happen after the I don't believe that pinning this short-lived PoH speed check thread to the same core as |
I see, makes sense. In this way, the change to run before other services start probably provides a hash rate more reflective of reality (when PoH core is pinned), right? A comment by the PoH speed check summarizing this reasoning would probably be helpful. |
Correct; there should be little to no "competition" for CPU time since there shouldn't be a glob of threads created yet.
I had a comment with the intent to say this, but I apparently didn't finish writing it 😅 :
Assuming I finish the comment and given what you know now, are you still in favor of the split approach with the noop for |
Yes 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a comment about why the measure/check is split in the comments for check_poh_speed()
; but I decided to include a short note about where we actually call measure()
in Validator::new()
. The extra visibility will hopefully prevent others from shuffling it around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Called out a few minor things
// If the hash rate on these clusters changes, we might consider updating this | ||
// constant. However, the PoH speed check compares hashes / second so this | ||
// constant does not have to be updated | ||
const POH_SPEED_CHECK_NUM_HASHES: u64 = 10_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reference UPDATED_HASHES_PER_SECOND_6
? https://github.com/anza-xyz/agave/blob/master/sdk/clock/src/lib.rs#L61
// | ||
// Skip the check for Development clusters as this will be the type for | ||
// tests or for clusters that we don't know what the rate will be | ||
// without a rebuilt Bank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the real reason that we don't want to pay the time cost for local cluster tests and such?
Because we end up pulling the rate from the root bank later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of, a couple things a
- Unit tests are run in debug mode ... doing 10M hashes in debug mode on local machine, potentially for multiple tests in parallel is not a good time
- You're correct that we pull the value from the Bank later, and because of that, we don't necessarily want to spend the time to hash 10M hashes (in hopefully a second) if the genesis config dictates we only need to be able to hit 100k.
I can look at tweaking the comment a bit more
// tests or for clusters that we don't know what the rate will be | ||
// without a rebuilt Bank | ||
let my_poh_hashes_per_second = | ||
match (genesis_config.cluster_type, !config.no_poh_speed_test) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matching !no with false arm breaks my brain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I initially did the following but cleared it
let do_speed_test = !config.no_speed_test
let my_poh_hashes_per_second = match ...
Naming a separate variable makes the not more visible so can switch it back if that's agreeable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, I'm cool with whatever
Problem
Checking the PoH speed requires a Bank to derive the cluster hash rate as of #2447. By the time a Bank is available, many services have started up and are competing for CPU time.
Summary of Changes
So, split the PoH speed check. Measure the speed very early on in Validator::new(), and check the value later once a Bank is available
Without the change:
With the change:
As you can see, a lot more variation without this change