Skip to content

Commit

Permalink
Opsgenie evaluation, alerts forwarding idea, decision
Browse files Browse the repository at this point in the history
  • Loading branch information
bastjan authored and simu committed Apr 6, 2023
1 parent c061047 commit 80b187b
Showing 1 changed file with 41 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa
* Respect OnCall engineers sanity and alert only once or as few times as possible
** Aggregate individual cluster alerts into one single alert
** Only send out alerts if any alert is firing for a certain period of time
** Supress cluster alerts during the maintenance window
* SLA relevant alerts shouldn't be supressed in any form
** Suppress cluster alerts during the maintenance window
* SLA relevant alerts shouldn't be suppressed in any form

=== Non-goals

Expand All @@ -24,28 +24,61 @@ There's multiple potential approaches to do so, see section <<_proposals,proposa
=== Option 1: Use centralized Mimir / Grafana

The upgrade controller is monitoring the clusters health and can emit metrics on the current state of the maintenance process.
Alternatively record rules could be used to create necessary metric timeseries.
We can send these few metrics to our centralized Mimir instance and implement alerting there.

Using metrics from the upgrade controller or record rulesd eliminates the need for sending a wast amount of different metrics to our centralized Mimir instance.
Alternatively record rules could be used to create necessary metric time series.
The Prometheus `ALERTS` metric is also a record rule under the hood.
It should be possible to remote write this metric to our centralized Mimir instance.
This would allow us to build alerting dashboards and meta alerts with minimal additional work and transmitted data.

=== Option 2: Use centralized Grafana and remote Datasources

Configure our centralized Grafana to access every clusters Prometheus as datasources.
Alert based on metrics from all datasources by Grafana.
Configure our centralized Grafana to access every clusters Prometheus as data source.
Alert based on metrics from all data sources by Grafana.

This approach can be used to also monitor clusters that use different upgrade mechanism, not only OpenShift 4 clusters.
Accessing the Prometheus instances from outside the cluster might be difficult for some customers with air-gapped setups and we would need a way to expose the Prometheus API to the outside.

Accessing the Prometheus instances from outside the cluster might be difficult for some customers with airgapped setups.
Using alerts managed by Grafana would be different from the current approach of using Prometheus Alertmanager.
It would need additional integration work into Opsgenie.

=== Option 3: Use Opsgenie

Opsgenie has some options to filter and group alerts together.
Special routes can be configured based on alert labels to wait for a specified time before alerting an OnCall engineer.

==== Grouping Alerts using Opsgenie aliases

There is a possibility to group alerts together using https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/[Opsgenie aliases].

Alertmanager https://github.com/prometheus/alertmanager/issues/1598[does not allow] control over this field currently.
We would need a proxy between Alertmanager and Opsgenie to set the alias field.
The configuration seems to be quite complex and error prone.

==== Maintenance Window

There is a possibility to configure a maintenance window for specific alerts.
During this time period a notification policy can delay alerting or auto close the alert.

This does not solve the grouping issue.

==== Incident Creation

There is a possibility to create incidents automatically based on alert labels.
This could allow us to create a "cluster maintenance" incident, with low priority, and add all alerts that are firing to it.
Closing the incident is not possible and would need to be done manually.
I couldn't find a way to delay alerts for a certain time period.

The incident creation seems to be quite buggy.
I could ACK an incident and it would still be shown as "unacknowledged" in the UI.

This does solve the grouping issue, but not the maintenance window end issue.

== Decision

We decided to go with option 1 and use a centralized Mimir / Grafana.

== Rationale

We already use a centralized Mimir instance for billing and SLOs.
Forwarding upgrade-controller metrics and alerts to Mimir should be minimal additional work.
Using Mimir we also can configure meta-alerts using PromQL and Alertmanager, both technologies we already know and use.

0 comments on commit 80b187b

Please sign in to comment.