Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redfish sensors based alerts has no duration so could flapping when the value is fluctuating around max value. #146

Closed
err404r opened this issue Dec 21, 2023 · 6 comments · Fixed by #194
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@err404r
Copy link

err404r commented Dec 21, 2023

I faced a two issues today first is the behavior similar to Issue#112 This time Fan readings are flapping from None to value and back, on the same server.

And this generates a huge amount of alerts. I will add a small duration to alert rule form 1 to 3 minutes probably.
To avoid false positives caused by flapping redfish api or something like that.

@jneo8
Copy link
Contributor

jneo8 commented Dec 25, 2023

close & re-open to trigger the github to JIRA action.

@jneo8 jneo8 closed this as completed Dec 25, 2023
@jneo8 jneo8 reopened this Dec 25, 2023
@Pjack Pjack added the bug Something isn't working label Dec 26, 2023
@Pjack Pjack added this to the 23.10.3 milestone Dec 26, 2023
@Pjack Pjack added enhancement New feature or request and removed bug Something isn't working labels Dec 26, 2023
@Pjack Pjack modified the milestones: 23.10.3, 23.10.4 Jan 30, 2024
@err404r
Copy link
Author

err404r commented Jan 30, 2024

The same applies to RedfishStorageControllerHealthNotOk alerts and probably other alerts, we need take into account that readings from redfish are not perfectly stable

@err404r
Copy link
Author

err404r commented Jan 30, 2024

And to IPMISensorStateNotOk alert

@dashmage
Copy link
Contributor

The Redfish "HealthNotAvailable" alerts have been removed since in a lot of cases the data might not be available but we wouldn't want an alert for each of them. The original linked issue has also been resolved.

@err404r - would this solve your issues as well?

@err404r
Copy link
Author

err404r commented Mar 21, 2024

Hi, no this issue is more general. then HealthNotAvailable.
On some motherboards readings are not 100% stable. So it's ok to periodically have sensor reading error.
I would say all sensor based alert rules should have for: 5m in their definition.
This will remove alerts from a single bad readings, short temperature spikes etc and make our life much more comfortable.

@dashmage
Copy link
Contributor

Thanks for your comment and I understand your intention better. Let me check in with the rest of the team for their opinion and then pick a suitable duration time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants