Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redfish resource health alerts when data not available. #113

Closed
dashmage opened this issue Nov 20, 2023 · 1 comment · Fixed by #131
Closed

Redfish resource health alerts when data not available. #113

dashmage opened this issue Nov 20, 2023 · 1 comment · Fixed by #131
Assignees
Labels
bug Something isn't working
Milestone

Comments

@dashmage
Copy link
Contributor

dashmage commented Nov 20, 2023

Currently, this is how the redfish metrics are created for the processor resource (it is similar for other redfish resources).
https://github.com/canonical/prometheus-hardware-exporter/blob/main/prometheus_hardware_exporter/collectors/redfish.py#L197

{
    "processor_id": processor["Id"],
    "model": processor["Model"] or "NA",
    "health": processor["Status"]["Health"] or "NA",
    "state": processor["Status"]["State"],
}

And this is how the alert rule corresponding to that metric looks like
https://github.com/canonical/hardware-observer-operator/blob/master/src/prometheus_alert_rules/redfish.yaml#L40

      - alert: RedfishProcessorHealthNotOk
        expr: redfish_processor_info{health != "OK"}
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Redfish processor health not OK. (instance {{ $labels.instance }})
          description: |
            Redfish processor health not OK.
              LABELS = {{ $labels }}

If the processor health is not available we assign it the value of "NA". This also triggers a critical alert (since it's not "OK") which might not be necessary just if the processor health status cannot be queried.

If needed, we could create a new alert with a lower severity if the resource health data is not available (with value "NA").

@dashmage dashmage added the bug Something isn't working label Nov 20, 2023
@Pjack Pjack added this to the 23.10.2 milestone Nov 30, 2023
@dashmage
Copy link
Contributor Author

dashmage commented Dec 4, 2023

#112 is related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants