Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert on missing metrics for enabled exporters #85

Closed
aieri opened this issue Nov 6, 2023 · 5 comments
Closed

Alert on missing metrics for enabled exporters #85

aieri opened this issue Nov 6, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@aieri
Copy link
Contributor

aieri commented Nov 6, 2023

If we autodetect the presence of a specific piece of hardware and the collector is enabled, we should expect the relative metrics to be present. If they are not (due to a bug, or to malfunctioning hardware that is no longer detected by the kernel), we should produce an alert.

@aieri aieri added the enhancement New feature or request label Nov 6, 2023
@aieri
Copy link
Contributor Author

aieri commented Nov 6, 2023

we should also confirm we only run the autodetection routine upon installation: we don't want to start disabling exporters if a raid array vanishes because it's faulty

@Pjack Pjack added bug Something isn't working and removed enhancement New feature or request labels Mar 28, 2024
@gabrielcocenza
Copy link
Member

Created a PR canonical/grafana-agent-operator#134 on grafana-agent to add rule to check physical disks that are removed.

For NICs we need to discuss a little bit more the strategy because right now there isn't an easy metric to know if the NIC is physical or not

@pengwyn
Copy link

pengwyn commented Jun 26, 2024

+1 for this issue. I've just noticed on one cloud we have completely lost monitoring on redfish due to some problem with the BMCs and we've had no alert that there was a problem.

Although in the case I'm looking at, I wonder if the charm is interpreting an error code from the ipmi checks as "There is no IPMI here", and then reconfiguring itself. Each unit is saying something like:

INFO unit.hardware-observer/0.juju-log IPMI sensors monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-norman1-ceph1.localhost: internal IPMI error
INFO unit.hardware-observer/0.juju-log IPMI SEL monitoring is not available
WARNING unit.hardware-observer/0.update-status ipmi_cmd_dcmi_get_power_reading: bad completion code
INFO unit.hardware-observer/0.juju-log IPMI DCMI monitoring is not available
WARNING unit.hardware-observer/0.update-status Get Device ID command failed: 0xc0 Node busy
ERROR unit.hardware-observer/0.juju-log unexpected error occurs when connecting to redfish: HTTPSConnectionPool(host='none', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f6267e84b50>: Failed to resolve 'none' ([Errno -3] Temporary failure in name resolution)"))
INFO unit.hardware-observer/0.juju-log Redfish is not available

@aieri
Copy link
Contributor Author

aieri commented Nov 14, 2024

canonical/grafana-agent-operator#147 takes care of some use cases: if an exporter goes down suddenly, an alert will be fired. It does not however cover the possibility of an exporter no longer be able to provide metrics about the underlying hardware. For example: if the BMC dies but the ipmi exporter continues working.

@Pjack
Copy link

Pjack commented Dec 16, 2024

We did some enhancements. However, it is difficult to track the issue without a well-defined problem statement.
I’d like to close this ticket for now unless a specific issue arises again. thanks!

@Pjack Pjack closed this as completed Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants