Skip to content

Alerting Condition Server

Alexandre Lamarre edited this page Feb 17, 2023 · 14 revisions

Alerting Condition Server

Summary:

The alerting condition server is used to store user specifications to create evaluation rules on data held in Alerting datasources.

The alerting condition server also handles querying the state of the specifications and returning it in a human readable format.

Table of contents

Architecture:

Alerting Gateway

Description

Create, Read, Update and Delete user configurations for data evaluations that should send alerts.

The condition server accepts AlertCondition specs even if the Alerting Backend is not enabled, but does not run any evaluations from datasources until the Alerting Backend is installed.

Condition CRUD

Alerting APIs Dataflow - Conditions CRUD

Responsibilities

Create, Read, Update, Delete Alerting Condition specs from user inputs. If Alerting backend is installed, create dependencies needed to evaluate conditions.

Corresponding UI element(s)

Description

  • Alerting/Alarms page

Screenshots

Alerting Alarms

Performance Issues

  • (K,V) stores that maintain many/all revisions (like etcd) can lead to update performance issues

Condition Status

Description

Determine Status based on downstream cluster dependencies, datasource dependencies & active state of the condition in the Alerting Backend.

An invalidated state means that the specification can no longer be reliably evaluated or evaluated at all.

Dataflow

Alerting APIs Dataflow - Condition Status

Responsibilities

Delegate API calls to management server, external dependencies & alerting cluster, in order to determine the state that best matches the condition.

Corresponding UI element(s)

Description

  • Alerting/Alarms admin UI page : state badge next to alarm name

Screenshots

Alerting Alarms State

Performance Issues

  • Status is evaluated on a per condition basis, but much of the information on the state queried by opni alerting is batched on all dependencies/ all active states of conditions in the Alerting Backend

Scale and performance:

  • Scale and performance concerns are delegated to the Alerting Backend and Datasources

High availability:

Tied to Opni Gateway High Availability.

Testing:

Testplan

Unit tests

  • Alerting storage clientset covers persisting alerting conditions user configurations

Integration tests

  • Covers CRUD Alerting Condition APIs, Silence APIs & Status APIs.

e2e tests

N/A

Manual testing

In Kubernetes cluster verify :

  • that we receive a notification when clicking test endpoint on a valid endpoint. This verifies that the entire alerting + gateway logic is functional.

  • Install Alerting & Monitoring with at least 1 metrics agent:

    • create a prometheus query alarm with a query sum(scrape_samples_scraped) != 0 -- after a couple minutes this should switch to firing state. This verifies the entire alerting + metrics logic is functional.
Clone this wiki locally