Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: INFRA-685 fix minio disk alerts #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 31 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,52 +10,54 @@ It is meant to:
- Be flexible enough to support unmanaged configuration outside the boilerplate that it manages

Currently, the two kinds of boilerplate that are supported:
- Node exporter rules and alerts for vms (number of hosts detected, cpu, ram, disks)
- Node exporter rules and alerts for VMs (number of hosts detected, CPU, RAM, disks)
- Terracd jobs metrics and alerts (to get the interval since the last plan/apply and a threshold value that will trigger an alert)

# Inputs

- **config**: This should be the value of the entrypoint **prometheus.yml** configuration file which will be generated from this value. The module will add some **rule_files** entries for the rule files it generates and otherwise will leave the content as is.
- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizting it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory.
- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizing it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory.
- **etcd_key_prefix**: Etcd prefix where the processed prometheus configuration will be synchronized.
- **node_exporter_jobs**: List of node exporter jobs to generate boilerplate for. Each entry should take the following keys:
- **tag**: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-node-exporter`
- **tag**: Tag for the node exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-node-exporter`
- **expected_count**: Expected number of instances associated with the job
- **memory_usage_threshold**: Maximum memory usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- **cpu_usage_threshold**: Maximum cpu usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- **expected_disks_count**: Expected number of disks (ex: 2). An alert will be triggered if the number of disks doesn't match. Can be set to -1 to disable the alert.
- **disk_space_usage_threshold**: Maximum disk space usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- **disk_io_usage_threshold**: Maximum disk io usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- **blackbox_exporter_jobs**: List of blackbox tcp/http exporter jobs to generate boilerplate for. Each entry should take the following keys:
- **tag**: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-blackbox-exporter`
- **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formated as prometheus expects in the **for** field of alert rules.
- **memory_usage_threshold**: Maximum memory usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
- **cpu_usage_threshold**: Maximum CPU usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
- **expected_disks_count**: Expected number of disks (e.g., 7). If set, an alert will be triggered if the number of disks does not match. Can be set to `-1` to disable this alert.
- **min_disks_count**: Minimum expected number of disks (e.g., 5). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range.
- **max_disks_count**: Maximum expected number of disks (e.g., 7). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range.
- **disk_space_usage_threshold**: Maximum disk space usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more.
- **disk_io_usage_threshold**: Maximum disk IO usage as a percentage (e.g., 95). An alert will be triggered if this threshold is crossed for 15 minutes or more.
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
- **blackbox_exporter_jobs**: List of blackbox TCP/HTTP exporter jobs to generate boilerplate for. Each entry should take the following keys:
- **tag**: Tag for the blackbox exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-blackbox-exporter`
- **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formatted as Prometheus expects in the **for** field of alert rules.
- **max_acceptable_latency**: Duration in seconds indicating the maximum acceptable response time for the service. If the service continuously takes longer than this to respond for an interval of time longer than **unavailability_tolerance**, a slow service alert will be triggered.
- **cert_renewal_window**: Delay in days indicating the expected renewal window for the tls certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
- **has_tls**: Boolean indicating whether the service expects a tls connection. If false, alerts for the cert renewal window and tls version will not be set.
- **expect_recent_tls**: Boolean indicating whether the service is expected to use tls version 1.3. If set to true and the service uses a version of tls older than 1.3, an alert will be triggered.
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- **cert_renewal_window**: Delay in days indicating the expected renewal window for the TLS certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
- **has_tls**: Boolean indicating whether the service expects a TLS connection. If false, alerts for the cert renewal window and TLS version will not be set.
- **expect_recent_tls**: Boolean indicating whether the service is expected to use TLS version 1.3. If set to true and the service uses a version of TLS older than 1.3, an alert will be triggered.
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
- **terracd_jobs**: List of terracd jobs to generate boilerplate for. Each entry should take the following keys:
- **tag**: Tag for the terracd job. It should correspond to the job name.
- **plan_interval_threshold**: Interval threshold after which an alert will be triggered if a **plan** or **apply** command did not run successfully. Used to diagnose a broken or non-running pipeline.
- **apply_interval_threshold**: Interval threshold after which an alert will be triggered if an **apply** command did not run successfully. Used to detect a pipeline that was left in **plan** and never put back on **apply**.
- **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interepreted and how the rules are processed (to be either in minutes or hours)
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
- **kubernetes_cluster_jobs**: List of kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
- **tag**: Tag for the kubernetes cluster job. It should correspond to the cluster name.
- **expected_services**: List of expected deployments that should have a certain number of long running instances. Each entry should have the following keys:
- **namespace**: Namespace where the service is expected to run
- **name**: Name of the service. It should match the k8 deployment name.
- **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interpreted and how the rules are processed (to be either in minutes or hours).
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts.
- **kubernetes_cluster_jobs**: List of Kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
- **tag**: Tag for the Kubernetes cluster job. It should correspond to the cluster name.
- **expected_services**: List of expected deployments that should have a certain number of long-running instances. Each entry should have the following keys:
- **namespace**: Namespace where the service is expected to run.
- **name**: Name of the service. It should match the Kubernetes deployment name.
- **expected_min_count**: Minimum expected number of instances that should be running.
- **expected_start_delay**: Expected delay before an instance is started. Running instances that have been around for less than that delay won't be considered running.
- **alert_labels**: Extra labels to add to alerts triggered for the service.
- **minio_cluster_jobs**: List of minio cluster jobs to generate boilerplate for. Each entry should take the following key:
- **tag**: Tag for the minio cluster job. It should correspond to the cluster name.
- **minio_cluster_jobs**: List of MinIO cluster jobs to generate boilerplate for. Each entry should take the following key:
- **tag**: Tag for the MinIO cluster job. It should correspond to the cluster name.
- **etcd_exporter_jobs**: List of etcd exporter jobs to generate boilerplate for. Each entry should take the following keys:
- **tag**: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called `<tag>-etcd-exporter`
- **expected_count**: Expected number of etcd members associated with the job
- **max_learn_time**: Max expected time for an etcd learner to catchup.
- **max_db_size**: Maximum expected data size (note that etcd has its own limit if 8GiB)
- **tag**: Tag for the etcd exporter job. It should consist of words separated by dashes. The job is expected to be called `<tag>-etcd-exporter`
- **expected_count**: Expected number of etcd members associated with the job.
- **max_learn_time**: Maximum expected time for an etcd learner to catch up.
- **max_db_size**: Maximum expected data size (note that etcd has its own limit of 8GiB).
- **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts.

# Example
Expand Down
17 changes: 17 additions & 0 deletions templates/node-exporter.yml.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,23 @@ groups:
annotations:
summary: "${title(replace(job.tag, "-", " "))} Number of Disks Unexpected"
description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected *${job.expected_disks_count}*."
%{ else ~}
%{ if job.min_disks_count >= 0 ~}
%{ if job.max_disks_count >= 0 }
- alert: ${replace(title(replace(job.tag, "-", " ")), " ", "")}DiskCountRangeMismatch
expr: (${replace(job.tag, "-", "_")}:disks:count < ${job.min_disks_count} or ${replace(job.tag, "-", "_")}:disks:count > ${job.max_disks_count})
for: 15m
%{ if length(job.alert_labels) > 0 ~}
labels:
%{ for key, val in job.alert_labels ~}
${key}: "${val}"
%{ endfor ~}
%{ endif ~}
annotations:
summary: "${title(replace(job.tag, "-", " "))} Disk Count Out of Range"
description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected between *${job.min_disks_count}* and *${job.max_disks_count}*."
%{ endif }
%{ endif ~}
%{ endif ~}
- record: ${replace(job.tag, "-", "_")}:filesystem_size:gigabytes
expr: node_filesystem_size_bytes{job="${job.tag}-node-exporter", fstype="ext4"} / 1024 / 1024 / 1024
Expand Down
Loading