Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How to handle Watchdog, which are always critical #61

Open
wattebausch opened this issue Nov 27, 2024 · 7 comments
Open

[Question]: How to handle Watchdog, which are always critical #61

wattebausch opened this issue Nov 27, 2024 · 7 comments
Assignees
Labels
feature New feature or request
Milestone

Comments

@wattebausch
Copy link

Ask a question

Hello everyone,

We use prometheus for our kubernetes cluster. As written in your article (https://blog.netways.de/blog/2023/07/25/check_prometheus-ist-jetzt-oeffentlich-verfuegbar/), we define all rules there. And we use the default Rules like general.rules

We would like to use your plugin, but we have a question about how you handle the watchdog. This is set to active by default.
https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
(If not firing then it should alert external systems that this alerting system is no longer working.)

If we now query ‘alerts’ in this way, we always have a critical state.

thanks for your plugin, looks great.

@wattebausch wattebausch added the question Further information is requested label Nov 27, 2024
@martialblog
Copy link
Member

Hi, thanks for the feedback.

I think you raise an interesting point regarding "dead man's switch" alerts, which is a common use case.
Not sure yet how the plugin should handle this, but I'll give it some thought.

Any ideas are welcome.

Cheers,
Markus

@martialblog martialblog added this to the v0.3.0 milestone Nov 28, 2024
@martialblog martialblog added the feature New feature or request label Nov 28, 2024
@martialblog
Copy link
Member

Just offloading some thoughts:

  • Maybe the alert subcheck could have a 'filter/exclude' flag, to get ALL alerts except the ones that match the filter
  • Maybe the alert subcheck could have flags that change/flip the exit codes? A bit strange, but the use case here is clear

@wattebausch
Copy link
Author

i find the idea of excluding useful, but there shouldn't be any ‘firing’ checks in prometheus (except watchdog). Perhaps we can differentiate between the two.

Watchdog

How do other alert managers deal with this, is there a standard? hard code if watchdog exists, flip and otherwise do not consider?

Feature "Filter/Exclude"

why useful, prometheus-community/helm-charts#5025 in the default ruleset of ‘Prometheus Rules’ there is a check ‘PrometheusNotConnectedToAlertmanagers’ and others, which alert because I have deactivated the "Alertmanager". My solution, I now maintain this ruleset Prometheus Rules’ by myself and have deleted 5 of them.
why else would it make sense. we allow our developers to write and deploy rules themselves. prometheus looks for all ‘kind: PrometheusRule’ and uses them. probably it could be helpful to be able to exclude them

my values from "kube-prometheus-stack" (shortened)

values:
    alertmanager:
      enabled: false
    ...
    defaultRules:
      rules:
        alertmanager: false
        ...
        prometheus: false
    ...

many false because we use k3s

@martialblog
Copy link
Member

Been thinking about this some more, a mixture of filter/exclude flags and a flag to define the expected alert state should do it. I will do some experiments this week and report back.

@martialblog martialblog removed the question Further information is requested label Dec 16, 2024
@martialblog martialblog self-assigned this Dec 16, 2024
@martialblog
Copy link
Member

@wattebausch I started to work on this, you can check out the code here #67

This adds the option to exclude certain alerts from the list.

I'm not 100% sure yet if this is sufficient or not. Since a Watchdog/Deadmanswitch is itself some "meta monitoring" one could argue that it checks itself. Since you want to get an alert if it doesn't do its work.

Am I wrong? Does this make sense?

@martialblog
Copy link
Member

@RincewindsHat any thoughts?

@wattebausch
Copy link
Author

@wattebausch I started to work on this, you can check out the code here #67

Works from the command line.

Does this make sense?

yes. possibly as a later feature and work with the exclude now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants