-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert on missing metrics for enabled exporters #85
Comments
we should also confirm we only run the autodetection routine upon installation: we don't want to start disabling exporters if a raid array vanishes because it's faulty |
Created a PR canonical/grafana-agent-operator#134 on grafana-agent to add rule to check physical disks that are removed. For NICs we need to discuss a little bit more the strategy because right now there isn't an easy metric to know if the NIC is physical or not |
+1 for this issue. I've just noticed on one cloud we have completely lost monitoring on redfish due to some problem with the BMCs and we've had no alert that there was a problem. Although in the case I'm looking at, I wonder if the charm is interpreting an error code from the ipmi checks as "There is no IPMI here", and then reconfiguring itself. Each unit is saying something like:
|
canonical/grafana-agent-operator#147 takes care of some use cases: if an exporter goes down suddenly, an alert will be fired. It does not however cover the possibility of an exporter no longer be able to provide metrics about the underlying hardware. For example: if the BMC dies but the ipmi exporter continues working. |
We did some enhancements. However, it is difficult to track the issue without a well-defined problem statement. |
If we autodetect the presence of a specific piece of hardware and the collector is enabled, we should expect the relative metrics to be present. If they are not (due to a bug, or to malfunctioning hardware that is no longer detected by the kernel), we should produce an alert.
The text was updated successfully, but these errors were encountered: