Alert on missing metrics for enabled exporters #85

aieri · 2023-11-06T14:52:48Z

If we autodetect the presence of a specific piece of hardware and the collector is enabled, we should expect the relative metrics to be present. If they are not (due to a bug, or to malfunctioning hardware that is no longer detected by the kernel), we should produce an alert.

aieri · 2023-11-06T19:54:52Z

we should also confirm we only run the autodetection routine upon installation: we don't want to start disabling exporters if a raid array vanishes because it's faulty

gabrielcocenza · 2024-06-18T16:07:40Z

Created a PR canonical/grafana-agent-operator#134 on grafana-agent to add rule to check physical disks that are removed.

For NICs we need to discuss a little bit more the strategy because right now there isn't an easy metric to know if the NIC is physical or not

pengwyn · 2024-06-26T04:57:46Z

+1 for this issue. I've just noticed on one cloud we have completely lost monitoring on redfish due to some problem with the BMCs and we've had no alert that there was a problem.

Although in the case I'm looking at, I wonder if the charm is interpreting an error code from the ipmi checks as "There is no IPMI here", and then reconfiguring itself. Each unit is saying something like:

INFO unit.hardware-observer/0.juju-log IPMI sensors monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-norman1-ceph1.localhost: internal IPMI error
INFO unit.hardware-observer/0.juju-log IPMI SEL monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_cmd_dcmi_get_power_reading: bad completion code
INFO unit.hardware-observer/0.juju-log IPMI DCMI monitoring is not available
WARNING unit.hardware-observer/0.update-status Get Device ID command failed: 0xc0 Node busy
ERROR unit.hardware-observer/0.juju-log unexpected error occurs when connecting to redfish: HTTPSConnectionPool(host='none', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f6267e84b50>: Failed to resolve 'none' ([Errno -3] Temporary failure in name resolution)"))
INFO unit.hardware-observer/0.juju-log Redfish is not available

aieri · 2024-11-14T17:31:47Z

canonical/grafana-agent-operator#147 takes care of some use cases: if an exporter goes down suddenly, an alert will be fired. It does not however cover the possibility of an exporter no longer be able to provide metrics about the underlying hardware. For example: if the BMC dies but the ipmi exporter continues working.

Pjack · 2024-12-16T04:03:37Z

We did some enhancements. However, it is difficult to track the issue without a well-defined problem statement.
I’d like to close this ticket for now unless a specific issue arises again. thanks!

aieri added the enhancement New feature or request label Nov 6, 2023

Pjack added bug Something isn't working and removed enhancement New feature or request labels Mar 28, 2024

Pjack closed this as completed Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert on missing metrics for enabled exporters #85

Alert on missing metrics for enabled exporters #85

aieri commented Nov 6, 2023 •

edited

Loading

aieri commented Nov 6, 2023

gabrielcocenza commented Jun 18, 2024

pengwyn commented Jun 26, 2024

aieri commented Nov 14, 2024

Pjack commented Dec 16, 2024

Alert on missing metrics for enabled exporters #85

Alert on missing metrics for enabled exporters #85

Comments

aieri commented Nov 6, 2023 • edited Loading

aieri commented Nov 6, 2023

gabrielcocenza commented Jun 18, 2024

pengwyn commented Jun 26, 2024

aieri commented Nov 14, 2024

Pjack commented Dec 16, 2024

aieri commented Nov 6, 2023 •

edited

Loading