Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Initial approach #1

Merged
merged 12 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# EditorConfig is awesome: http://EditorConfig.org
# Uses editorconfig to maintain consistent coding styles

# top-most EditorConfig file
root = true

# Unix-style newlines with a newline ending every file
[*]
charset = utf-8
end_of_line = lf
indent_size = 2
indent_style = space
insert_final_newline = true
max_line_length = 120
trim_trailing_whitespace = true

[{go.mod,go.sum,*.go}]
indent_style = tab
indent_size = 4

[*.{tf,tfvars}]
indent_size = 2
indent_style = space

[*.md]
max_line_length = 0
trim_trailing_whitespace = false

[COMMIT_EDITMSG]
max_line_length = 0
36 changes: 36 additions & 0 deletions .github/workflows/terratest.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Terratest
on: pull_request

permissions: {}

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22.x'

- name: Install dependencies
run: |
pwd
cd test
go get .

- name: Test with the Go CLI
run: |
pwd
cd test
go test -v

- name: Check for updated README (terraform-docs)
uses: terraform-docs/[email protected]
with:
working-dir: .
fail-on-diff: "true"
config-file: ".terraform-docs.yml"
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@
# .tfstate files
*.tfstate
*.tfstate.*
.terraform.lock.hcl

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which are likely to contain sensitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.
*.tfvars
*.tfvars.json
Expand Down
14 changes: 14 additions & 0 deletions .terraform-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
formatter: "markdown"

output:
file: "README.md"

settings:
anchor: false
indent: 3

sections:
show:
- providers
- inputs
- outputs
73 changes: 72 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,73 @@
# terraform-grafana-prometheus-alerts
Terraform module to convert Prometheus alert rules to Grafana alerts

Terraform module to convert [Prometheus Alerting rules] to [Grafana-managed alert rules]

## Motivation / Why using this module

There are plenty of apps (mostly out of CNCF's ecosystem) where the vendor or the community provides monitoring dashboards
and alerts. Dashboards are normally provided as a JSON file which can be loaded into Grafana. Alerts are mostly provided
as [Prometheus Alerting rules].

There are users who already operate a Grafana instance or use a managed Grafana instance from a cloud provider (Grafana
Cloud, Amazon Managed Grafana, Azure Managed Grafana, etc.). Why not using this Grafana instance for the
alerting?

The problem is that Grafana's unified alerting uses another format for the alert definition but the concept with labels,
annotations (provide description and runbook URLs) is almost identical.
This module allows you to reuse the [Prometheus Alerting rules] and configure them inside Grafana.

## Example usage

```hcl
module "cert_manager_rules" {
source = "github.com/mkilchhofer/terraform-grafana-prometheus-alerts"

prometheus_alerts_file_path = file("/path/to/alerts/cert-manager.yaml")
folder_uid = grafana_folder.test.uid
datasource_uid = grafana_data_source.prometheus.uid
}
```

## Requirements

- Grafana 8.0+ (Unified alerting)

## Limitations

- Defining multiple alerts with the same name is not supported in Grafana

## Overriding definitions of Prometheus Alerting file

TODO

## TF module documentation

<!-- BEGIN_TF_DOCS -->
### Providers

| Name | Version |
|------|---------|
| grafana | ~> 3.2 |

### Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| datasource\_uid | The UID of the Grafana datasource being queried with the expressions inside the Alerting rule file | `string` | n/a | yes |
| default\_evaluation\_interval\_duration | How often is the rule evaluated by default. (When not defined inside your Alerting rules file) | `string` | `"5m"` | no |
| disable\_provenance | Allow modifying the rule group from other sources than Terraform or the Grafana API. | `bool` | `false` | no |
| folder\_uid | The UID of the Grafana folder that the alerts belongs to. | `string` | n/a | yes |
| org\_id | The Organization ID of of the Grafana Alerting rule groups. (Only supported with basic auth, API keys are already org-scoped) | `string` | `null` | no |
| overrides | Overrides per Alert rule | <pre>map(object({<br> alert_threshold = optional(number)<br> exec_err_state = optional(string)<br> is_paused = optional(bool)<br> no_data_state = optional(string)<br> labels = optional(map(string))<br> }))</pre> | `{}` | no |
| prometheus\_alerts\_file\_path | Path to the Prometheus Alerting rules file | `string` | n/a | yes |

### Outputs

| Name | Description |
|------|-------------|
| alertsfile\_map | n/a |
| file\_as\_yaml | n/a |
<!-- END_TF_DOCS -->

[Grafana-managed alert rules]: https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/#grafana-managed-alert-rules
[Prometheus Alerting rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
134 changes: 134 additions & 0 deletions grafana_alert.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
resource "grafana_rule_group" "this" {
# for_each = local.file_as_yaml.groups
for_each = local.alertsfile_map

name = each.value.name
folder_uid = var.folder_uid
org_id = var.org_id

# There is no function supporting Golang's "duration" (format of interval within an alert group)
# Use timeadd() function which supports it.
interval_seconds = (
(parseint(formatdate("s", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 1) +
(parseint(formatdate("m", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 60) +
(parseint(formatdate("h", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 3600)
)

disable_provenance = var.disable_provenance

dynamic "rule" {
for_each = {for rule in each.value.rules: rule.alert => rule}

content {
name = rule.value.alert
for = try(rule.value.for, null)
condition = "ALERTCONDITION"

annotations = {for k, v in rule.value.annotations : k => replace(v, "$value", "$values.QUERY_RESULT.Value")}
labels = merge(rule.value.labels, try(var.overrides[rule.value.alert].labels, {}))

exec_err_state = try(var.overrides[rule.value.alert].exec_err_state, null)
is_paused = try(var.overrides[rule.value.alert].is_paused, null)
no_data_state = try(var.overrides[rule.value.alert].no_data_state, null)

data {
ref_id = "QUERY"
relative_time_range {
from = 600
to = 0
}
datasource_uid = var.datasource_uid
model = jsonencode({
editorMode = "code"
expr = rule.value.expr
intervalMs = 1000
maxDataPoints = 43200
refId = "QUERY"
})
}

## Reduce
data {
ref_id = "QUERY_RESULT"
relative_time_range {
from = 600
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
"conditions" = [
{
"evaluator" = {
"params" = [0]
"type" = "gt"
}
"operator" = {
"type" = "and"
}
"query" = {
"params" = []
}
"reducer" = {
"params" = []
"type" = "avg"
}
"type" = "query"
},
]
"datasource" = {
"name" = "Expression"
"type" = "__expr__"
"uid" = "__expr__"
}
"expression" = "QUERY"
"intervalMs" = 1000
"maxDataPoints" = 43200
"reducer" = "last"
"refId" = "QUERY_RESULT"
"type" = "reduce"
})
}

## Threshold
data {
ref_id = "ALERTCONDITION"
relative_time_range {
from = 600
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
"conditions" = [
{
"evaluator" = {
"params" = [try(var.overrides[rule.value.alert].alert_threshold, 0)]
"type" = "gt"
}
"operator" = {
"type" = "and"
}
"query" = {
"params" = ["QUERY_RESULT"]
}
"reducer" = {
"params" = []
"type" = "last"
}
"type" = "query"
},
]
"datasource" = {
"type" = "__expr__"
"uid" = "__expr__"
}
"expression" = "QUERY_RESULT"
"hide" = false
"intervalMs" = 1000
"maxDataPoints" = 43200
"refId" = "ALERTCONDITION"
"type" = "threshold"
})
}
}
}
}
4 changes: 4 additions & 0 deletions locals.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
locals {
file_as_yaml = yamldecode(var.prometheus_alerts_file_path)
alertsfile_map = {for group in local.file_as_yaml.groups: group.name => group}
}
7 changes: 7 additions & 0 deletions outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
output "file_as_yaml" {
value = local.file_as_yaml
}

output "alertsfile_map" {
value = local.alertsfile_map
}
61 changes: 61 additions & 0 deletions test/alerts-cert-manager.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Source: https://github.com/monitoring-mixins/website/blob/master/assets/cert-manager/alerts.yaml
groups:
- name: cert-manager
rules:
- alert: CertManagerAbsent
annotations:
description: New certificates will not be able to be minted, and existing ones
can't be renewed until cert-manager is back.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerabsent
summary: Cert Manager has disappeared from Prometheus service discovery.
expr: absent(up{job="cert-manager"})
for: 10m
labels:
severity: critical
- name: certificates
rules:
- alert: CertManagerCertExpirySoon
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: The domain that this cert covers will be unavailable after {{ $value
| humanizeDuration }}. Clients using endpoints that this cert protects will
start to fail in {{ $value | humanizeDuration }}.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertexpirysoon
summary: The cert `{{ $labels.name }}` is {{ $value | humanizeDuration }} from
expiry, it should have renewed over a week ago.
expr: |
avg by (exported_namespace, namespace, name) (
certmanager_certificate_expiration_timestamp_seconds - time()
) < (21 * 24 * 3600) # 21 days in seconds
for: 1h
labels:
severity: warning
- alert: CertManagerCertNotReady
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: This certificate has not been ready to serve traffic for at least
10m. If the cert is being renewed or there is another valid cert, the ingress
controller _may_ be able to serve that instead.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertnotready
summary: The cert `{{ $labels.name }}` is not ready to serve traffic.
expr: |
max by (name, exported_namespace, namespace, condition) (
certmanager_certificate_ready_status{condition!="True"} == 1
)
for: 10m
labels:
severity: critical
- alert: CertManagerHittingRateLimits
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: Depending on the rate limit, cert-manager may be unable to generate
certificates for up to a week.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerhittingratelimits
summary: Cert manager hitting LetsEncrypt rate limits.
expr: |
sum by (host) (
rate(certmanager_http_acme_client_request_count{status="429"}[5m])
) > 0
for: 5m
labels:
severity: critical
Loading
Loading