Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report ConsumeWorkerMetrics at slot transitions #3212

Merged
merged 5 commits into from
Oct 24, 2024

Conversation

ksolana
Copy link

@ksolana ksolana commented Oct 18, 2024

Problem

ConsumeWorkerMetrics are reported every 1s mostly for convenience. Having these metrics at
slot transitions is desirable.

Summary of Changes

  • Save the slot# while reporting in order to track slot transitions.
  • Remove the interval as it is not needed anymore.
Log from test-validator with solana airdrop

[2024-10-24T15:03:34.588387000Z INFO  solana_metrics::metrics] datapoint: banking_stage_worker_counts,id=2 transactions_attempted_processing_count=1i processed_transactions_count=1i processed_with_successful_result_count=1i retryable_transaction_count=0i retryable_expired_bank_count=0i cost_model_throttled_transactions_count=0i min_prioritization_fees=-1i max_prioritization_fees=0i slot=169i
[2024-10-24T15:03:34.588413000Z INFO  solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=28i collect_balances_us=48i load_execute_us=244i freeze_lock_us=0i record_us=58i commit_us=63i find_and_send_votes_us=44i wait_for_bank_success_us=2i wait_for_bank_failure_us=0i slot=169i
[2024-10-24T15:03:34.588430000Z INFO  solana_metrics::metrics] datapoint: banking_stage_worker_error_metrics,id=2 total=0i account_in_use=0i too_many_account_locks=0i account_loaded_twice=0i account_not_found=0i blockhash_not_found=0i blockhash_too_old=0i call_chain_too_deep=0i already_processed=0i instruction_error=0i insufficient_funds=0i invalid_account_for_fee=0i invalid_account_index=0i invalid_program_for_execution=0i invalid_compute_budget=0i not_allowed_during_cluster_maintenance=0i invalid_writable_account=0i invalid_rent_paying_account=0i would_exceed_max_block_cost_limit=0i would_exceed_max_account_cost_limit=0i would_exceed_max_vote_cost_limit=0i slot=169i

Fixes: #478

@ksolana ksolana requested review from apfitzge and steviez October 18, 2024 06:03
self.timing_metrics.report_and_reset(&self.id);
self.error_metrics.report_and_reset(&self.id);
pub fn maybe_report_and_reset(&self, slot: Option<Slot>) {
if slot.is_some() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am leader for slots [100, 101, 102, 103], we want to report those metrics in a timely fashion.
As is, this will only report for 103 whenever my next leader slot is. We need to handle the case where the new slot argument is None (indicating we are not leader).

@ksolana ksolana force-pushed the slot_interval branch 3 times, most recently from 0d9d4b5 to 848ac2c Compare October 18, 2024 23:56
/// Report and reset metrics when the worker did some work and:
/// a) (when a leader) Previous slot is not the same as current.
/// b) (when not a leader) report the metrics accumulated so far.
pub fn maybe_report_and_reset(&self, slot: Option<Slot>) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function may be called more than once per leader slot.
we need to check that the slot is different than the cached slot before resetting has_data

- Save the slot# while reporting in order to track
slot transitions.
- Remove the interval as it is not needed anymore.

Fixes: anza-xyz#478
@steviez
Copy link

steviez commented Oct 21, 2024

Hey @ksolana - I haven't actually gotten the chance to review this, but when you are addressing feedback comments, can you please push new commits instead of merging changes + force pushing ?

This PR happens to smaller, but for larger reviews, being able to view only the changes in new commits is very valuable vs. having to review the whole changeset again. Plus, our repo has a rule to squash commits on merge, and you can clean up the commit title + message before you merge

@ksolana
Copy link
Author

ksolana commented Oct 21, 2024

Will do.

self.error_metrics.report_and_reset(&self.id);
self.slot.swap(slot, Ordering::Relaxed);
}
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will spam, we need to check if there was or was not a previous slot stored in the metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every time the scheduler loop runs it may call into this function. this function needs to check if the slot is different than what was previously there

@@ -194,7 +194,7 @@ impl ConsumeWorkerMetrics {
self.error_metrics.report_and_reset(&self.id);
self.slot.swap(slot, Ordering::Relaxed);
}
} else {
} else if prev_slot_id != 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to reset self.slot in here. otherwise it will still spam.

You should start up a test validator and check logs. It should not report more than once per slot.

WIP: Testing in local testnet
Comment on lines 192 to 194
self.count_metrics.report_and_reset(&self.id);
self.timing_metrics.report_and_reset(&self.id);
self.error_metrics.report_and_reset(&self.id);
Copy link

@apfitzge apfitzge Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to report the slot on these metrics too. otherwise we can't directly associate them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best to add slot in the name of the metrics as well so that they don't get mixed during transition period

@apfitzge
Copy link

f58a61f

This always reports zero since you never update the slot on those metrics.
Please follow the pattern that most other slot-based metrics take: pass the slot into the report function. It doesn't need to be stored on each of the inner metrics.

I strongly encourage you to test your changes. Make sure that the logs are printing, or not printing as expected.
You can do this by simply starting a local test-validator and sending some airdrop transactions.

[2024-10-23T20:43:09.245828000Z INFO  solana_metrics::metrics] datapoint: banking_stage_worker_counts,id=2 transactions_attempted_processing_count=1i processed_transactions_count=1i processed_with_successful_result_count=1i retryable_transaction_count=0i retryable_expired_bank_count=0i cost_model_throttled_transactions_count=0i min_prioritization_fees=-1i max_prioritization_fees=0i slot=0i

@ksolana
Copy link
Author

ksolana commented Oct 24, 2024

f58a61f

This always reports zero since you never update the slot on those metrics. Please follow the pattern that most other slot-based metrics take: pass the slot into the report function. It doesn't need to be stored on each of the inner metrics.

Fixed.

I strongly encourage you to test your changes. Make sure that the logs are printing, or not printing as expected. You can do this by simply starting a local test-validator and sending some airdrop transactions.

[2024-10-23T20:43:09.245828000Z INFO  solana_metrics::metrics] datapoint: banking_stage_worker_counts,id=2 transactions_attempted_processing_count=1i processed_transactions_count=1i processed_with_successful_result_count=1i retryable_transaction_count=0i retryable_expired_bank_count=0i cost_model_throttled_transactions_count=0i min_prioritization_fees=-1i max_prioritization_fees=0i slot=0i

Added the log in PR description.

@ksolana ksolana merged commit 415a78a into anza-xyz:master Oct 24, 2024
40 checks passed
@ksolana ksolana deleted the slot_interval branch October 24, 2024 23:38
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
* Report ConsumeWorkerMetrics at slot transitions

- Save the slot# while reporting in order to track
slot transitions.
- Report slot# for the three metrics
- Remove the interval as it is not needed anymore.
- Only report when there was a slot
- Reset slot after reporting

Fixes: anza-xyz#478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ConsumeWorkerMetrics: slot level reporting
3 participants