Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove status metrics for deleted resources #10597

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ryanrolds
Copy link

@ryanrolds ryanrolds commented Feb 4, 2025

Description

The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.

Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).

When we upgrade to a new package we can remove this. The E2E test should have long-term value.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
kgateway-dev#6938

@ryanrolds ryanrolds changed the title Clear resource metrics to remove metrics for deleted resources Remove status metrics for deleted resources Feb 5, 2025
@ryanrolds ryanrolds marked this pull request as ready for review February 5, 2025 17:22
@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick-ci

Copy link

github-actions bot commented Feb 5, 2025

Visit the preview URL for this PR (updated for commit d69d642):

https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app

(expires Tue, 18 Feb 2025 21:28:22 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick

Copy link
Collaborator

@nfuden nfuden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

If an extra example can be added to prove otherwise then I am onboard

pkg/utils/statsutils/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/utils/statsutils/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/utils/statsutils/metrics/metrics.go Show resolved Hide resolved
@@ -402,6 +408,10 @@ func (s *statusSyncer) extractCurrentReports() (reporter.ResourceReports, map[re
func (s *statusSyncer) syncStatus(ctx context.Context) error {
allReports, inputResourceBySubresourceStatuses, localInputResourceLastStatus := s.extractCurrentReports()

// reset the resource metrics before we write the new statuses, this ensures that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant this cause flickering to an operator?
Why cant we set new statuses and then clear all out of compliance ones after?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not able to clear out metrics at the label/attribute level. Only at the metric-level (view), which includes all labels).

@ryanrolds
Copy link
Author

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants