-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove status metrics for deleted resources #10597
base: main
Are you sure you want to change the base?
Conversation
Issues linked to changelog: |
/kick-ci |
/kick-ci |
Visit the preview URL for this PR (updated for commit d69d642): https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app (expires Tue, 18 Feb 2025 21:28:22 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
/kick-ci |
/kick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.
If an extra example can be added to prove otherwise then I am onboard
@@ -402,6 +408,10 @@ func (s *statusSyncer) extractCurrentReports() (reporter.ResourceReports, map[re | |||
func (s *statusSyncer) syncStatus(ctx context.Context) error { | |||
allReports, inputResourceBySubresourceStatuses, localInputResourceLastStatus := s.extractCurrentReports() | |||
|
|||
// reset the resource metrics before we write the new statuses, this ensures that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cant this cause flickering to an operator?
Why cant we set new statuses and then clear all out of compliance ones after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not able to clear out metrics at the label/attribute level. Only at the metric-level (view), which includes all labels).
Co-authored-by: Nathan Fudenberg <[email protected]>
Co-authored-by: Nathan Fudenberg <[email protected]>
The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about. |
Description
The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.
Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).
When we upgrade to a new package we can remove this. The E2E test should have long-term value.
Checklist: