Skip to content

Commit

Permalink
Add an alert to catch mismtach of sts replicas vs expected/ready (#654)
Browse files Browse the repository at this point in the history
  • Loading branch information
philipgough authored Nov 29, 2023
1 parent 74df668 commit 1ce21b2
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 0 deletions.
34 changes: 34 additions & 0 deletions docs/sop/observatorium.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
* [ObservatoriumNoRulesLoaded](#observatoriumnorulesloaded)
* [ObservatoriumPersistentVolumeUsageHigh](#observatoriumpersistentvolumeusagehigh)
* [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical)
* [ObservatoriumExpectedReplicasUnavailable](#observatoriumexpectedreplicasunavailable)
* [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts)
* [GubernatorIsDown](#gubernatorisdown)
* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts)
Expand Down Expand Up @@ -866,6 +867,39 @@ One or more PVCs are filled to more than 95%. The remaining free space does not
- Locate the affected deployment in the [AppSRE Interface](https://gitlab.cee.redhat.com/service/app-interface/-/tree/master/data/services/rhobs), depending on which namespace the alert is coming from
- Increase the size of the PVC by adjusting the relevant parameter in one of the `saas.yaml` files

## ObservatoriumExpectedReplicasUnavailable

### Impact

A StatefulSet belonging to the RHOBS service is not running the expected number of replicas for a prolonged period of time.
This may impact the metric query or ingest performance of the system.

### Summary

A StatefulSet has an undesired amount of replicas. This may be caused by a number of reasons, including:

1. Pod stuck in a terminating state.
2. Pod unable to be scheduled due to constraints on the cluster such as node capacity or resource limits.

### Severity

`critical`

### Access Required

- Console access to the cluster that runs Observatorium.
- Edit access to the Observatorium namespaces:
- `observatorium-metrics-stage`
- `observatorium-metrics-production`
- `observatorium-mst-stage`
- `observatorium-mst-production`

### Steps

- Check the alert and establish which component is the one affected.
- Determine the reason for the missing replica(s).
- Act on the above information to address the issue.

# Observatorium Gubernator Alerts

## GubernatorIsDown
Expand Down
14 changes: 14 additions & 0 deletions observability/prometheusrules.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,20 @@ local renderAlerts(name, environment, mixin) = {
severity: 'critical',
},
},
{
alert: 'ObservatoriumExpectedReplicasUnavailable',
annotations: {
description: 'The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.',
summary: 'One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.',
},
expr: |||
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
|||,
'for': '20m',
labels: {
severity: 'critical',
},
},
],
},
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,16 @@ spec:
labels:
service: telemeter
severity: critical
- alert: ObservatoriumExpectedReplicasUnavailable
annotations:
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/observatorium-metrics?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m
description: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
message: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#observatoriumexpectedreplicasunavailable
summary: One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.
expr: |
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
for: 20m
labels:
service: telemeter
severity: critical
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,16 @@ spec:
labels:
service: telemeter
severity: high
- alert: ObservatoriumExpectedReplicasUnavailable
annotations:
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/observatorium-metrics?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m
description: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
message: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#observatoriumexpectedreplicasunavailable
summary: One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.
expr: |
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
for: 20m
labels:
service: telemeter
severity: high

0 comments on commit 1ce21b2

Please sign in to comment.