Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple kafkaClusters in kafka alerts #2124

Merged
merged 2 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions monitoring/kafka/alerts.test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ tests:
description: 'Kafka: Broker count is down'
exp_labels:
severity: warning
kafkaCluster: ${cluster}
- alertname: BrokersCountCritical
eval_time: 3m
exp_alerts: []
Expand All @@ -50,6 +51,7 @@ tests:
exp_labels:
namespace: zenko
service: artesca-data-base-queue
kafkaCluster: ${cluster}
severity: critical

# ActiveControllerCritical
Expand Down Expand Up @@ -78,6 +80,7 @@ tests:
summary: 'Kafka: No active controller'
exp_labels:
severity: critical
kafkaCluster: ${cluster}
- alertname: ActiveControllerCritical
eval_time: 3m
exp_alerts: []
Expand Down Expand Up @@ -108,6 +111,7 @@ tests:
summary: 'Kafka: 1 under-replicated partitons'
exp_labels:
severity: critical
kafkaCluster: ${cluster}
- alertname: UnderReplicatedPartitions
eval_time: 3m
exp_alerts:
Expand All @@ -119,6 +123,7 @@ tests:
summary: 'Kafka: 2 under-replicated partitons'
exp_labels:
severity: critical
kafkaCluster: ${cluster}

# OfflinePartitons
##################################################################################################
Expand Down Expand Up @@ -147,6 +152,7 @@ tests:
summary: 'Kafka: 1 offline partitons'
exp_labels:
severity: critical
kafkaCluster: ${cluster}
- alertname: OfflinePartitons
eval_time: 3m
exp_alerts:
Expand All @@ -159,6 +165,7 @@ tests:
summary: 'Kafka: 2 offline partitons'
exp_labels:
severity: critical
kafkaCluster: ${cluster}

# RemainingDiskSpaceWarning
##################################################################################################
Expand Down Expand Up @@ -198,6 +205,7 @@ tests:
namespace: zenko
persistentvolumeclaim: artesca-data-base-queue-1
severity: warning
kafkaCluster: ${cluster}
- alertname: RemainingDiskSpaceWarning
eval_time: 5d8h
exp_alerts: []
Expand Down Expand Up @@ -225,6 +233,7 @@ tests:
summary: Zookeeper Sync Disconected
exp_labels:
severity: warning
kafkaCluster: ${cluster}

# ConsumerLagWarning
##################################################################################################
Expand Down Expand Up @@ -274,6 +283,7 @@ tests:
cluster_name: artesca-data-base-queue
group: notification
severity: warning
kafkaCluster: ${cluster}
- alertname: ConsumerLagWarning
eval_time: 20m
exp_alerts:
Expand All @@ -290,6 +300,7 @@ tests:
cluster_name: artesca-data-base-queue
group: replication
severity: warning
kafkaCluster: ${cluster}
- exp_annotations:
description: |
Kafka consumer lag has been more more than 300 seconds
Expand All @@ -303,3 +314,4 @@ tests:
cluster_name: artesca-data-base-queue
group: notification
severity: warning
kafkaCluster: ${cluster}
8 changes: 8 additions & 0 deletions monitoring/kafka/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ groups:
for: 1m
labels:
severity: warning
kafkaCluster: ${cluster}
annotations:
summary: 'Not all expected brokers are online.'
description: 'Kafka: Broker count is down'
Expand All @@ -44,6 +45,7 @@ groups:
for: 1m
labels:
severity: critical
kafkaCluster: ${cluster}
annotations:
summary: 'No Brokers online'
description: 'Kafka: Broker count is 0'
Expand All @@ -53,6 +55,7 @@ groups:
for: 1m
labels:
severity: critical
kafkaCluster: ${cluster}
annotations:
description: >-
No broker in the cluster is reporting as the active controller in the last 1 minute interval. During steady state there should
Expand All @@ -64,6 +67,7 @@ groups:
for: 1m
labels:
severity: critical
kafkaCluster: ${cluster}
annotations:
description: >-
Under-replicated partitions means that one or more replicas are not available. This is usually because a broker is down. Restart
Expand All @@ -75,6 +79,7 @@ groups:
for: 1m
labels:
severity: critical
kafkaCluster: ${cluster}
annotations:
description: >-
After successful leader election, if the leader for partition dies, then the partition moves to the OfflinePartition state.
Expand All @@ -91,6 +96,7 @@ groups:
for: 2m
labels:
severity: warning
kafkaCluster: ${cluster}
annotations:
description: 'Kafka Broker has low disk space'
summary: 'Kafka Broker has low disk space'
Expand All @@ -101,6 +107,7 @@ groups:
for: 1m
labels:
severity: warning
kafkaCluster: ${cluster}
annotations:
summary: 'Zookeeper Sync Disconected'
description: 'Kafka Zookeeper Sync Disconected'
Expand All @@ -116,6 +123,7 @@ groups:
for: 5m
labels:
severity: warning
kafkaCluster: ${cluster}
annotations:
summary: 'Kafka: consumer lag is too high for {{ $labels.group }}'
description: |
Expand Down
Loading