-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report ConsumeWorkerMetrics at slot transitions #3212
Conversation
self.timing_metrics.report_and_reset(&self.id); | ||
self.error_metrics.report_and_reset(&self.id); | ||
pub fn maybe_report_and_reset(&self, slot: Option<Slot>) { | ||
if slot.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am leader for slots [100, 101, 102, 103]
, we want to report those metrics in a timely fashion.
As is, this will only report for 103
whenever my next leader slot is. We need to handle the case where the new slot
argument is None
(indicating we are not leader).
0d9d4b5
to
848ac2c
Compare
/// Report and reset metrics when the worker did some work and: | ||
/// a) (when a leader) Previous slot is not the same as current. | ||
/// b) (when not a leader) report the metrics accumulated so far. | ||
pub fn maybe_report_and_reset(&self, slot: Option<Slot>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function may be called more than once per leader slot.
we need to check that the slot is different than the cached slot before resetting has_data
- Save the slot# while reporting in order to track slot transitions. - Remove the interval as it is not needed anymore. Fixes: anza-xyz#478
848ac2c
to
716ca7f
Compare
Hey @ksolana - I haven't actually gotten the chance to review this, but when you are addressing feedback comments, can you please push new commits instead of merging changes + force pushing ? This PR happens to smaller, but for larger reviews, being able to view only the changes in new commits is very valuable vs. having to review the whole changeset again. Plus, our repo has a rule to squash commits on merge, and you can clean up the commit title + message before you merge |
Will do. |
self.error_metrics.report_and_reset(&self.id); | ||
self.slot.swap(slot, Ordering::Relaxed); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will spam, we need to check if there was or was not a previous slot stored in the metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every time the scheduler loop runs it may call into this function. this function needs to check if the slot is different than what was previously there
@@ -194,7 +194,7 @@ impl ConsumeWorkerMetrics { | |||
self.error_metrics.report_and_reset(&self.id); | |||
self.slot.swap(slot, Ordering::Relaxed); | |||
} | |||
} else { | |||
} else if prev_slot_id != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to reset self.slot in here. otherwise it will still spam.
You should start up a test validator and check logs. It should not report more than once per slot.
WIP: Testing in local testnet
self.count_metrics.report_and_reset(&self.id); | ||
self.timing_metrics.report_and_reset(&self.id); | ||
self.error_metrics.report_and_reset(&self.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to report the slot on these metrics too. otherwise we can't directly associate them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably best to add slot
in the name of the metrics as well so that they don't get mixed during transition period
This always reports zero since you never update the slot on those metrics. I strongly encourage you to test your changes. Make sure that the logs are printing, or not printing as expected.
|
Fixed.
Added the log in PR description. |
* Report ConsumeWorkerMetrics at slot transitions - Save the slot# while reporting in order to track slot transitions. - Report slot# for the three metrics - Remove the interval as it is not needed anymore. - Only report when there was a slot - Reset slot after reporting Fixes: anza-xyz#478
Problem
ConsumeWorkerMetrics are reported every 1s mostly for convenience. Having these metrics at
slot transitions is desirable.
Summary of Changes
Fixes: #478