From 80b187bb0d9998e21d41d2b17850f93fe3ee566e Mon Sep 17 00:00:00 2001 From: Sebastian Widmer Date: Wed, 5 Apr 2023 16:34:17 +0200 Subject: [PATCH] Opsgenie evaluation, alerts forwarding idea, decision --- .../decisions/maintenance-alerts.adoc | 49 ++++++++++++++++--- 1 file changed, 41 insertions(+), 8 deletions(-) diff --git a/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc b/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc index 6ccbc083..228afb2d 100644 --- a/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc +++ b/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc @@ -12,8 +12,8 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa * Respect OnCall engineers sanity and alert only once or as few times as possible ** Aggregate individual cluster alerts into one single alert ** Only send out alerts if any alert is firing for a certain period of time -** Supress cluster alerts during the maintenance window -* SLA relevant alerts shouldn't be supressed in any form +** Suppress cluster alerts during the maintenance window +* SLA relevant alerts shouldn't be suppressed in any form === Non-goals @@ -24,28 +24,61 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa === Option 1: Use centralized Mimir / Grafana The upgrade controller is monitoring the clusters health and can emit metrics on the current state of the maintenance process. -Alternatively record rules could be used to create necessary metric timeseries. We can send these few metrics to our centralized Mimir instance and implement alerting there. -Using metrics from the upgrade controller or record rulesd eliminates the need for sending a wast amount of different metrics to our centralized Mimir instance. +Alternatively record rules could be used to create necessary metric time series. +The Prometheus `ALERTS` metric is also a record rule under the hood. +It should be possible to remote write this metric to our centralized Mimir instance. +This would allow us to build alerting dashboards and meta alerts with minimal additional work and transmitted data. === Option 2: Use centralized Grafana and remote Datasources -Configure our centralized Grafana to access every clusters Prometheus as datasources. -Alert based on metrics from all datasources by Grafana. +Configure our centralized Grafana to access every clusters Prometheus as data source. +Alert based on metrics from all data sources by Grafana. -This approach can be used to also monitor clusters that use different upgrade mechanism, not only OpenShift 4 clusters. +Accessing the Prometheus instances from outside the cluster might be difficult for some customers with air-gapped setups and we would need a way to expose the Prometheus API to the outside. -Accessing the Prometheus instances from outside the cluster might be difficult for some customers with airgapped setups. +Using alerts managed by Grafana would be different from the current approach of using Prometheus Alertmanager. +It would need additional integration work into Opsgenie. === Option 3: Use Opsgenie Opsgenie has some options to filter and group alerts together. Special routes can be configured based on alert labels to wait for a specified time before alerting an OnCall engineer. +==== Grouping Alerts using Opsgenie aliases + +There is a possibility to group alerts together using https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/[Opsgenie aliases]. + +Alertmanager https://github.com/prometheus/alertmanager/issues/1598[does not allow] control over this field currently. +We would need a proxy between Alertmanager and Opsgenie to set the alias field. +The configuration seems to be quite complex and error prone. + +==== Maintenance Window + +There is a possibility to configure a maintenance window for specific alerts. +During this time period a notification policy can delay alerting or auto close the alert. + +This does not solve the grouping issue. + +==== Incident Creation + +There is a possibility to create incidents automatically based on alert labels. +This could allow us to create a "cluster maintenance" incident, with low priority, and add all alerts that are firing to it. +Closing the incident is not possible and would need to be done manually. +I couldn't find a way to delay alerts for a certain time period. + +The incident creation seems to be quite buggy. +I could ACK an incident and it would still be shown as "unacknowledged" in the UI. + +This does solve the grouping issue, but not the maintenance window end issue. == Decision +We decided to go with option 1 and use a centralized Mimir / Grafana. == Rationale +We already use a centralized Mimir instance for billing and SLOs. +Forwarding upgrade-controller metrics and alerts to Mimir should be minimal additional work. +Using Mimir we also can configure meta-alerts using PromQL and Alertmanager, both technologies we already know and use.