feat: INFRA-685 fix minio disk alerts

Ferlab-Ste-Justine · Sep 8, 2024 · 5acbd71 · 5acbd71
1 parent 638fe43
commit 5acbd71
Show file tree

Hide file tree

Showing 2 changed files with 48 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -10,52 +10,54 @@ It is meant to:
   - Be flexible enough to support unmanaged configuration outside the boilerplate that it manages
 
 Currently, the two kinds of boilerplate that are supported:
-- Node exporter rules and alerts for vms (number of hosts detected, cpu, ram, disks)
+- Node exporter rules and alerts for VMs (number of hosts detected, CPU, RAM, disks)
 - Terracd jobs metrics and alerts (to get the interval since the last plan/apply and a threshold value that will trigger an alert)
 
 # Inputs
 
 - **config**: This should be the value of the entrypoint **prometheus.yml** configuration file which will be generated from this value. The module will add some **rule_files** entries for the rule files it generates and otherwise will leave the content as is.
-- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizting it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory.
+- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizing it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory.
 - **etcd_key_prefix**: Etcd prefix where the processed prometheus configuration will be synchronized.
 - **node_exporter_jobs**: List of node exporter jobs to generate boilerplate for. Each entry should take the following keys:
-  - **tag**: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-node-exporter`
+  - **tag**: Tag for the node exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-node-exporter`
   - **expected_count**: Expected number of instances associated with the job
-  - **memory_usage_threshold**: Maximum memory usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
-  - **cpu_usage_threshold**: Maximum cpu usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
-  - **expected_disks_count**: Expected number of disks (ex: 2). An alert will be triggered if the number of disks doesn't match. Can be set to -1 to disable the alert.
-  - **disk_space_usage_threshold**: Maximum disk space usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
-  - **disk_io_usage_threshold**: Maximum disk io usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
-  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
-- **blackbox_exporter_jobs**: List of blackbox tcp/http exporter jobs to generate boilerplate for. Each entry should take the following keys:
-  - **tag**: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-blackbox-exporter`
-  - **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formated as prometheus expects in the **for** field of alert rules.
+  - **memory_usage_threshold**: Maximum memory usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
+  - **cpu_usage_threshold**: Maximum CPU usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
+  - **expected_disks_count**: Expected number of disks (e.g., 7). If set, an alert will be triggered if the number of disks does not match. Can be set to `-1` to disable this alert.
+  - **min_disks_count**: Minimum expected number of disks (e.g., 5). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range.
+  - **max_disks_count**: Maximum expected number of disks (e.g., 7). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range.
+  - **disk_space_usage_threshold**: Maximum disk space usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
+  - **disk_io_usage_threshold**: Maximum disk IO usage as a percentage (e.g., 95). An alert will be triggered if this threshold is crossed for 15 minutes or more.
+  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
+- **blackbox_exporter_jobs**: List of blackbox TCP/HTTP exporter jobs to generate boilerplate for. Each entry should take the following keys:
+  - **tag**: Tag for the blackbox exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-blackbox-exporter`
+  - **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formatted as Prometheus expects in the **for** field of alert rules.
   - **max_acceptable_latency**: Duration in seconds indicating the maximum acceptable response time for the service. If the service continuously takes longer than this to respond for an interval of time longer than **unavailability_tolerance**, a slow service alert will be triggered.
-  - **cert_renewal_window**: Delay in days indicating the expected renewal window for the tls certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
-  - **has_tls**: Boolean indicating whether the service expects a tls connection. If false, alerts for the cert renewal window and tls version will not be set.
-  - **expect_recent_tls**: Boolean indicating whether the service is expected to use tls version 1.3. If set to true and the service uses a version of tls older than 1.3, an alert will be triggered.
-  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
+  - **cert_renewal_window**: Delay in days indicating the expected renewal window for the TLS certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
+  - **has_tls**: Boolean indicating whether the service expects a TLS connection. If false, alerts for the cert renewal window and TLS version will not be set.
+  - **expect_recent_tls**: Boolean indicating whether the service is expected to use TLS version 1.3. If set to true and the service uses a version of TLS older than 1.3, an alert will be triggered.
+  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
 - **terracd_jobs**: List of terracd jobs to generate boilerplate for. Each entry should take the following keys:
   - **tag**: Tag for the terracd job. It should correspond to the job name.
   - **plan_interval_threshold**: Interval threshold after which an alert will be triggered if a **plan** or **apply** command did not run successfully. Used to diagnose a broken or non-running pipeline.
   - **apply_interval_threshold**: Interval threshold after which an alert will be triggered if an **apply** command did not run successfully. Used to detect a pipeline that was left in **plan** and never put back on **apply**.
-  - **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interepreted and how the rules are processed (to be either in minutes or hours)
-  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
-- **kubernetes_cluster_jobs**: List of kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
-  - **tag**: Tag for the kubernetes cluster job. It should correspond to the cluster name.
-  - **expected_services**: List of expected deployments that should have a certain number of long running instances. Each entry should have the following keys:
-    - **namespace**: Namespace where the service is expected to run
-    - **name**: Name of the service. It should match the k8 deployment name.
+  - **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interpreted and how the rules are processed (to be either in minutes or hours).
+  - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
+- **kubernetes_cluster_jobs**: List of Kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
+  - **tag**: Tag for the Kubernetes cluster job. It should correspond to the cluster name.
+  - **expected_services**: List of expected deployments that should have a certain number of long-running instances. Each entry should have the following keys:
+    - **namespace**: Namespace where the service is expected to run.
+    - **name**: Name of the service. It should match the Kubernetes deployment name.
     - **expected_min_count**: Minimum expected number of instances that should be running.
     - **expected_start_delay**: Expected delay before an instance is started. Running instances that have been around for less than that delay won't be considered running.
     - **alert_labels**: Extra labels to add to alerts triggered for the service.
-- **minio_cluster_jobs**: List of minio cluster jobs to generate boilerplate for. Each entry should take the following key:
-  - **tag**: Tag for the minio cluster job. It should correspond to the cluster name.
+- **minio_cluster_jobs**: List of MinIO cluster jobs to generate boilerplate for. Each entry should take the following key:
+  - **tag**: Tag for the MinIO cluster job. It should correspond to the cluster name.
 - **etcd_exporter_jobs**: List of etcd exporter jobs to generate boilerplate for. Each entry should take the following keys:
-  - **tag**: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-etcd-exporter`
-  - **expected_count**: Expected number of etcd members associated with the job
-  - **max_learn_time**: Max expected time for an etcd learner to catchup. 
-  - **max_db_size**: Maximum expected data size (note that etcd has its own limit if 8GiB)
+  - **tag**: Tag for the etcd exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-etcd-exporter`
+  - **expected_count**: Expected number of etcd members associated with the job.
+  - **max_learn_time**: Maximum expected time for an etcd learner to catch up.
+  - **max_db_size**: Maximum expected data size (note that etcd has its own limit of 8GiB).
   - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
 
 # Example

diff --git a/templates/node-exporter.yml.tpl b/templates/node-exporter.yml.tpl
@@ -67,6 +67,23 @@ groups:
         annotations:
           summary: "${title(replace(job.tag, "-", " "))} Number of Disks Unexpected"
           description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected *${job.expected_disks_count}*."
+%{ else ~}
+%{ if job.min_disks_count >= 0 ~}
+  %{ if job.max_disks_count >= 0 }
+    - alert: ${replace(title(replace(job.tag, "-", " ")), " ", "")}DiskCountRangeMismatch
+      expr: (${replace(job.tag, "-", "_")}:disks:count < ${job.min_disks_count} or ${replace(job.tag, "-", "_")}:disks:count > ${job.max_disks_count})
+      for: 15m
+%{ if length(job.alert_labels) > 0 ~}
+      labels:
+%{ for key, val in job.alert_labels ~}
+        ${key}: "${val}"
+%{ endfor ~}
+%{ endif ~}
+      annotations:
+        summary: "${title(replace(job.tag, "-", " "))} Disk Count Out of Range"
+        description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected between *${job.min_disks_count}* and *${job.max_disks_count}*."
+%{ endif }
+%{ endif ~}
 %{ endif ~}
       - record: ${replace(job.tag, "-", "_")}:filesystem_size:gigabytes
         expr: node_filesystem_size_bytes{job="${job.tag}-node-exporter", fstype="ext4"} / 1024 / 1024 / 1024