-
Notifications
You must be signed in to change notification settings - Fork 12
Monitoring: Notifications
Notifications are part of monitoring that is run after each data collection cycle. It's configurable mechanism to check if metrics values are within allowed value range, and if not, send notification to designated receivers (registered users or external emails).
Notification mechanism is composed of several classes, responsible for different aspects:
- High-level configuration:
NotificationCheck
:
Keeps general description, list of metric check definition, send grace period configuration and last send marker, list of users to which notification should be delivered (in helper table, NotificationReceiver
class).
- Per-metric definition:
MetricNotificationDefinition
:
Keeps per-metric-per-check configuration: name of metric, min, max values allowed for user, check type (if value should be below or above given threshold, or should last read be not older than specific period from metric check), additional scope for check (resource, label, ows service - this part is partially implemented). Definition object is created from NotificationCheck.user_tresholds
data, and is used to generate validation form. Note, that one NotificationCheck
can have several definition items, for set of different metrics. Definition rows are created when NotificationCheck
is created, or updated.
- Per-metric check configuration:
MetricNotificationCheck
Keeps per-metric-per-check configuration: metric and threshold values. It is created after user submits configuration form for specific notification.
Notifications are checked after each collection/processing period in collection script, by calling CollectorAPI.emit_notifications(for_timestamp)
. This will do following:
- get all notifications,
- for each notification, will get all notification checks
- for each notification check, it will get metric valid for given timestamp and check if value matches given criteria
- each check can raise exception, which will be captured in caller, and for each notification, list of errors will be returned
- based on list of notifications and errors, alerts will be generated and send to users, unless last delivery was before grace period is finished.
Additionally, notifications expose /monitoring/api/status/
[status endpoint](#Status API), which will show errors detected at the moment of request.
Status endpoint presents current state of error checking performed by notifications. Frontend can make requests periodically to this endpoint. There is no history view for status at the moment.
Status response is wrapped with standard response envelope. Non-error response will have status
key set to ok
and success
to true
, otherwise errors
will be not empty.
No errors response:
GET /monitoring/api/status/
{"status": "ok",
"data": [],
"success": true}
Response with errors reported:
{
"status": "ok",
"data": [
{
"problems": [
{
"threshold_value": "2017-08-29T10:45:26.142",
"message": "Value collected too far in the past",
"name": "request.count",
"offending_value": "2017-08-25T16:41:00"
}
],
"check": {
"grace_period": {
"seconds": 600,
"class": "datetime.timedelta"
},
"last_send": null,
"description": "detects when requests are not handled",
"user_threshold": {
"3": {
"max": 10,
"metric": "request.count",
"steps": null,
"description": "Number of handled requests is lower than",
"min": 0
},
"4": {
"max": null,
"metric": "request.count",
"steps": null,
"description": "No response for at least",
"min": 60
},
"5": {
"max": null,
"metric": "response.time",
"steps": null,
"description": "Response time is higher than",
"min": 500
}
},
"id": 2,
"name": "geonode is not working"
}
}
],
"success": true
}
Response with reported errors contains list of check elements in data
element. Each check element contains:
-
check
- serializedNotificationCheck
object, which was used -
problems
- list of metric checks that failed. Each element contains name of metric, error message, measured and threshold value.
This call will return list of available notifications:
GET /monitoring/api/notifications/
{"status": "ok",
"data": [ {"url": "/monitoring/api/notifications/config/2/",
"description": "detects when requests are not handled",
"id": 2,
"name": "geonode is not working"}],
"errors": {},
"success": true}
Response will contain list of notifications summary in data
key. Each element will have:
-
name
,description
andid
of notification -
url
to notification details
This will return details for notification, including form and list of allowed fields:
GET /monitoring/api/notifications/config/{{notification_id}}/
{
"status": "ok",
"errors": {},
"data": {
"fields": [
{
"use_resource": false,
"description": "Number of handled requests is lower than",
"metric": {
"class": "geonode.contrib.monitoring.models.Metric",
"name": "request.count",
"id": 2
},
"use_label": false,
"use_ows_service": false,
"field_option": "min_value",
"use_service": false,
"notification_check": {
"class": "geonode.contrib.monitoring.models.NotificationCheck",
"name": "geonode is not working",
"id": 2
},
"field_name": "request.count.min_value",
"id": 3
},
{
"use_resource": false,
"description": "No response for at least",
"metric": {
"class": "geonode.contrib.monitoring.models.Metric",
"name": "request.count",
"id": 2
},
"use_label": false,
"use_ows_service": false,
"field_option": "max_timeout",
"use_service": false,
"notification_check": {
"class": "geonode.contrib.monitoring.models.NotificationCheck",
"name": "geonode is not working",
"id": 2
},
"field_name": "request.count.max_timeout",
"id": 4
},
{
"use_resource": false,
"description": "Response time is higher than",
"metric": {
"class": "geonode.contrib.monitoring.models.Metric",
"name": "response.time",
"id": 11
},
"use_label": false,
"use_ows_service": false,
"field_option": "max_value",
"use_service": false,
"notification_check": {
"class": "geonode.contrib.monitoring.models.NotificationCheck",
"name": "geonode is not working",
7 "id": 2
},
"field_name": "response.time.max_value",
"id": 5
}
],
"form": "<tr><th><label for=\"id_emails\">Emails:</label></th><td><textarea cols=\"40\" id=\"id_emails\" name=\"emails\" rows=\"10\">\r\n</textarea></td></tr>\n<tr><th><label for=\"id_request.count.min_value\">Request.count.min value:</label></th><td><input id=\"id_request.count.min_value\" max=\"10\" min=\"0\" name=\"request.count.min_value\" step=\"0.01\" type=\"number\" /></td></tr>\n<tr><th><label for=\"id_request.count.max_timeout\">Request.count.max timeout:</label></th><td><input id=\"id_request.count.max_timeout\" min=\"60\" name=\"request.count.max_timeout\" step=\"0.01\" type=\"number\" /></td></tr>\n<tr><th><label for=\"id_response.time.max_value\">Response.time.max value:</label></th><td><input id=\"id_response.time.max_value\" min=\"500\" name=\"response.time.max_value\" step=\"0.01\" type=\"number\" /></td></tr>",
"notification": {
"grace_period": {
"seconds": 600,
"class": "datetime.timedelta"
},
"last_send": null,
"description": "detects when requests are not handled",
"user_threshold": {
"3": {
"max": 10,
"metric": "request.count",
"steps": null,
"description": "Number of handled requests is lower than",
"min": 0
},
"4": {
"max": null,
"metric": "request.count",
"steps": null,
"description": "No response for at least",
"min": 60
},
"5": {
"max": null,
"metric": "response.time",
"steps": null,
"description": "Response time is higher than",
"min": 500
}
},
"id": 2,
"name": "geonode is not working"
}
},
"success": true
}
Returned keys in data
element:
-
fields
- list of form fields, including detailed per-resource configuration flags -
form
- rendered user form, which can be displayed -
notification
- serialized notification object withuser_thresholds
list (this is a base to createfields
objects)
Frontend should use either fields
(and create whole form in client-side) or form
(just put value as html node) values, and submit it to the same url.
Following API call allows user to configure notification by setting receivers and adjust threshold values for checks:
POST /monitoring/api/notifications/config/{{notification_check_id}}/
request.count.max_value=val
description=more tesddddt
request.count.min_value=1
name=new name
emails=list of emails
Response contains serialized NotificationCheck
in data
element, if no errors were captured during form processing:
{
"status": "ok",
"errors": {},
"data": {
"grace_period": {
"seconds": 600,
"class": "datetime.timedelta"
},
"last_send": null,
"description": "more test",
"user_threshold": {
"request.count.max_value": {
"max": null,
"metric": "request.count",
"steps": null,
"description": "Max number of request",
"min": 1000
},
"request.count.min_value": {
"max": 100,
"metric": "request.count",
"steps": null,
"description": "Min number of request",
"min": 0
}
},
"id": 293,
"name": "test"
},
"success": true
}
Error (non-200) response will have errors
key populated:
{
"status": "error",
"errors": {
"user_threshold": [
"This field is required."
],
"name": [
"This field is required."
],
"description": [
"This field is required."
]
},
"data": [],
"success": false
}
This API call allows to create new notification, it's different in form layout from edition:
POST /monitoring/api/notifications/
name=Name of notification (geonode doesn't work)
description=This will check if geonode is serving any data
emails=
user_thresholds=
Payload elements:
-
name
,description
are values visible for user -
emails
is a list of emails, however, it is encoded to a string, where each email is in new line: -
user_thresholds
is a json encoded list of per-metric-per-check configurations. Each element of list should be a 10-elemnt list, containing:- name of metric
- field check option (one of three values:
min_value
,max_value
ormax_timeout
) - flag, if metric check can use service
- flag, if metric check can use resource
- flag, if metric check can use label
- flag, if metric check can use ows service
- minimum value for user input (no minimum check if None)
- maximum value for user input (no maximum check if None)
- steps count is a number of steps to generate for user input, so user can select value from select list instead of typing. This will have effect only if both min and max values are also provided
Sample payload for
user_thresholds
:
[('request.count', 'min_value', False, False, False, False, 0, 100, None, "Min number of request"), ('request.count', 'max_value', False, False, False, False, 1000, None, None, "Max number of request"), ]
Response is a serialized NotificationCheck
wrapped with standard response envelope (status, errors etc). Actual data is in data
key. If processing failed, for example because of form validation errors, response will be non-200 OK, and errors
key will be populated.
{
"status": "ok",
"errors": {},
"data": {
"grace_period": {
"seconds": 600,
"class": "datetime.timedelta"
},
"last_send": null,
"description": "more test",
"user_threshold": {
"request.count.max_value": {
"max": 100,
"metric": "request.count",
"steps": null,
"description": "Min number of request",
"min": 0
},
"request.count.min_value": {
"max": null,
"metric": "request.count",
"steps": null,
"description": "Max number of request",
"min": 1000
}
},
"id": 257,
"name": "test"
},
"success": true
}