Opsgenie evaluation, alerts forwarding idea, decision

appuio · Apr 6, 2023 · 80b187b · 80b187b
1 parent c061047
commit 80b187b
Showing 1 changed file with 41 additions and 8 deletions.
diff --git a/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc b/docs/modules/ROOT/pages/explanations/decisions/maintenance-alerts.adoc
@@ -12,8 +12,8 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa
 * Respect OnCall engineers sanity and alert only once or as few times as possible
 ** Aggregate individual cluster alerts into one single alert
 ** Only send out alerts if any alert is firing for a certain period of time
-** Supress cluster alerts during the maintenance window
-* SLA relevant alerts shouldn't be supressed in any form
+** Suppress cluster alerts during the maintenance window
+* SLA relevant alerts shouldn't be suppressed in any form
 
 === Non-goals
 
@@ -24,28 +24,61 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa
 === Option 1: Use centralized Mimir / Grafana
 
 The upgrade controller is monitoring the clusters health and can emit metrics on the current state of the maintenance process.
-Alternatively record rules could be used to create necessary metric timeseries.
 We can send these few metrics to our centralized Mimir instance and implement alerting there.
 
-Using metrics from the upgrade controller or record rulesd eliminates the need for sending a wast amount of different metrics to our centralized Mimir instance.
+Alternatively record rules could be used to create necessary metric time series.
+The Prometheus `ALERTS` metric is also a record rule under the hood.
+It should be possible to remote write this metric to our centralized Mimir instance.
+This would allow us to build alerting dashboards and meta alerts with minimal additional work and transmitted data.
 
 === Option 2: Use centralized Grafana and remote Datasources
 
-Configure our centralized Grafana to access every clusters Prometheus as datasources.
-Alert based on metrics from all datasources by Grafana.
+Configure our centralized Grafana to access every clusters Prometheus as data source.
+Alert based on metrics from all data sources by Grafana.
 
-This approach can be used to also monitor clusters that use different upgrade mechanism, not only OpenShift 4 clusters.
+Accessing the Prometheus instances from outside the cluster might be difficult for some customers with air-gapped setups and we would need a way to expose the Prometheus API to the outside.
 
-Accessing the Prometheus instances from outside the cluster might be difficult for some customers with airgapped setups.
+Using alerts managed by Grafana would be different from the current approach of using Prometheus Alertmanager.
+It would need additional integration work into Opsgenie.
 
 === Option 3: Use Opsgenie
 
 Opsgenie has some options to filter and group alerts together.
 Special routes can be configured based on alert labels to wait for a specified time before alerting an OnCall engineer.
 
+==== Grouping Alerts using Opsgenie aliases
+
+There is a possibility to group alerts together using https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/[Opsgenie aliases].
+
+Alertmanager https://github.com/prometheus/alertmanager/issues/1598[does not allow] control over this field currently.
+We would need a proxy between Alertmanager and Opsgenie to set the alias field.
+The configuration seems to be quite complex and error prone.
+
+==== Maintenance Window
+
+There is a possibility to configure a maintenance window for specific alerts.
+During this time period a notification policy can delay alerting or auto close the alert.
+
+This does not solve the grouping issue.
+
+==== Incident Creation
+
+There is a possibility to create incidents automatically based on alert labels.
+This could allow us to create a "cluster maintenance" incident, with low priority, and add all alerts that are firing to it.
+Closing the incident is not possible and would need to be done manually.
+I couldn't find a way to delay alerts for a certain time period.
+
+The incident creation seems to be quite buggy.
+I could ACK an incident and it would still be shown as "unacknowledged" in the UI.
+
+This does solve the grouping issue, but not the maintenance window end issue.
 
 == Decision
 
+We decided to go with option 1 and use a centralized Mimir / Grafana.
 
 == Rationale
 
+We already use a centralized Mimir instance for billing and SLOs.
+Forwarding upgrade-controller metrics and alerts to Mimir should be minimal additional work.
+Using Mimir we also can configure meta-alerts using PromQL and Alertmanager, both technologies we already know and use.