Skip to content

Monitoring: Notifications

Cezary Statkiewicz edited this page Sep 4, 2017 · 12 revisions

Notifications architecture

Notifications are part of monitoring that is run after each data collection cycle. It's configurable mechanism to check if metrics values are within allowed value range, and if not, send notification to designated receivers (registered users or external emails).

Data model

Notification mechanism is composed of several classes, responsible for different aspects:

  • High-level configuration: NotificationCheck:

Keeps general description, list of metric check definition, send grace period configuration and last send marker, list of users to which notification should be delivered (in helper table, NotificationReceiver class).

  • Per-metric definition: MetricNotificationDefinition:

Keeps per-metric-per-check configuration: name of metric, min, max values allowed for user, check type (if value should be below or above given threshold, or should last read be not older than specific period from metric check), additional scope for check (resource, label, ows service - this part is partially implemented). Definition object is created from NotificationCheck.user_tresholds data, and is used to generate validation form. Note, that one NotificationCheck can have several definition items, for set of different metrics. Definition rows are created when NotificationCheck is created, or updated.

  • Per-metric check configuration: MetricNotificationCheck

Keeps per-metric-per-check configuration: metric and threshold values. It is created after user submits configuration form for specific notification.

Workflow

Notifications are checked after each collection/processing period in collection script, by calling CollectorAPI.emit_notifications(for_timestamp). This will do following:

  • get all notifications,
  • for each notification, will get all notification checks
  • for each notification check, it will get metric valid for given timestamp and check if value matches given criteria
  • each check can raise exception, which will be captured in caller, and for each notification, list of errors will be returned
  • based on list of notifications and errors, alerts will be generated and send to users, unless last delivery was before grace period is finished.

Additionally, notifications expose /monitoring/api/status/ status endpoint, which will show errors detected at the moment of request.

Web API

Status API

Status endpoint presents current state of error checking performed by notifications. Frontend can make requests periodically to this endpoint. There is no history view for status at the moment. Status response is wrapped with standard response envelope. Non-error response will have status key set to ok and success to true, otherwise errors will be not empty.

No errors response:

GET /monitoring/api/status/

{"status": "ok",
 "data": [],
 "success": true}

Response with errors reported:

{
  "status": "ok",
  "data": [
    {
      "problems": [
        {
          "threshold_value": "2017-08-29T10:45:26.142",
          "message": "Value collected too far in the past",
          "name": "request.count",
          "severity": "warning",
          "offending_value": "2017-08-25T16:41:00"
        }
      ],
      "check": {
        "grace_period": {
          "seconds": 600,
          "class": "datetime.timedelta"
        },
        "last_send": null,
        "description": "detects when requests are not handled",
        "severity": "warning",
        "user_threshold": {
          "3": {
            "max": 10,
            "metric": "request.count",
            "steps": null,
            "description": "Number of handled requests is lower than",
            "min": 0
          },
          "4": {
            "max": null,
            "metric": "request.count",
            "steps": null,
            "description": "No response for at least",
            "min": 60
          },
          "5": {
            "max": null,
            "metric": "response.time",
            "steps": null,
            "description": "Response time is higher than",
            "min": 500
          }
        },
        "id": 2,
        "name": "geonode is not working"
      }
    }
  ],
  "success": true
}

Response with reported errors contains list of check elements in data element. Each check element contains:

  • check - serialized NotificationCheck object, which was used
  • problems - list of metric checks that failed. Each element contains name of metric, severity, error message, measured and threshold value.

Severity

Severity is a textual description of potential impact of error. There are three values: warning, error and fatal.

Notification list

This call will return list of available notifications:

GET /monitoring/api/notifications/


{"status": "ok", 
 "data": [ {"url": "/monitoring/api/notifications/config/2/", 
                     "description": "detects when requests are not handled", 
                     "severity": "warning",
                     "id": 2, 
                     "name": "geonode is not working"}],
 "errors": {},
 "success": true}

Response will contain list of notifications summary in data key. Each element will have:

  • name, description and id of notification
  • url to notification details

Notification details

This will return details for notification, including form and list of allowed fields:

GET /monitoring/api/notifications/config/{{notification_id}}/

{
  "status": "ok",
  "errors": {},
  "data": {
    "fields": [
      {
        "use_resource": false,
        "description": "Number of handled requests is lower than",
        "min": null,
        "max_value": "10.0000",
        "metric": {
          "class": "geonode.contrib.monitoring.models.Metric",
          "name": "request.count",
          "id": 2
        },
        "min_value": "0.0000",
        "use_label": false,
        "steps_calculated": [
          "0.0000",
          "3.33",
          "6.67",
          "10.0"
        ],
        "use_ows_service": false,
        "field_option": "min_value",
        "use_service": false,
        "max": null,
        "current_value": null,
        "steps": 3,
        "notification_check": {
          "class": "geonode.contrib.monitoring.models.NotificationCheck",
          "name": "geonode is not working",
          "id": 2
        },
        "field_name": "request.count.min_value",
        "id": 3,
        "unit": ""
      },
      {
        "use_resource": false,
        "description": "No response for at least",
        "min": null,
        "max_value": null,
        "metric": {
          "class": "geonode.contrib.monitoring.models.Metric",
          "name": "request.count",
          "id": 2
        },
        "min_value": "60.0000",
        "use_label": false,
        "steps_calculated": null,
        "use_ows_service": false,
        "field_option": "max_timeout",
        "use_service": false,
        "max": null,
        "current_value": null,
        "steps": null,
        "notification_check": {
          "class": "geonode.contrib.monitoring.models.NotificationCheck",
          "name": "geonode is not working",
          "id": 2
        },
        "field_name": "request.count.max_timeout",
        "id": 4,
        "unit": ""
      },
      {
        "use_resource": false,
        "description": "Response time is higher than",
        "min": null,
        "max_value": null,
        "metric": {
          "class": "geonode.contrib.monitoring.models.Metric",
          "name": "response.time",
          "id": 11
        },
        "min_value": "500.0000",
        "use_label": false,
        "steps_calculated": null,
        "use_ows_service": false,
        "field_option": "max_value",
        "use_service": false,
        "max": null,
        "current_value": null,
        "steps": null,
        "notification_check": {
          "class": "geonode.contrib.monitoring.models.NotificationCheck",
          "name": "geonode is not working",
          "id": 2
        },
        "field_name": "response.time.max_value",
        "id": 5,
        "unit": "s"
      },
      {
        "use_resource": false,
        "description": "dsfdsf",
        "min": null,
        "max_value": null,
        "metric": {
          "class": "geonode.contrib.monitoring.models.Metric",
          "name": "response.time",
          "id": 11
        },
        "min_value": null,
        "use_label": false,
        "steps_calculated": null,
        "use_ows_service": false,
        "field_option": "min_value",
        "use_service": false,
        "max": null,
        "current_value": null,
        "steps": null,
        "notification_check": {
          "class": "geonode.contrib.monitoring.models.NotificationCheck",
          "name": "geonode is not working",
          "id": 2
        },
        "field_name": "response.time.min_value",
        "id": 6,
        "unit": "s"
      },
      {
        "use_resource": false,
        "description": "Incoming traffic should be higher than",
        "min": null,
        "max_value": null,
        "metric": {
          "class": "geonode.contrib.monitoring.models.Metric",
          "name": "network.in.rate",
          "id": 34
        },
        "min_value": null,
        "use_label": false,
        "steps_calculated": null,
        "use_ows_service": false,
        "field_option": "min_value",
        "use_service": false,
        "max": null,
        "current_value": null,
        "steps": null,
        "notification_check": {
          "class": "geonode.contrib.monitoring.models.NotificationCheck",
          "name": "geonode is not working",
          "id": 2
        },
        "field_name": "network.in.rate.min_value",
        "id": 7,
        "unit": "B/s"
      }
    ],
    "form": "<tr><th><label for=\"id_emails\">Emails:</label></th><td><textarea cols=\"40\" id=\"id_emails\" name=\"emails\" rows=\"10\">\r\n\[email protected]</textarea></td></tr>\n<tr><th><label for=\"id_severity\">Severity:</label></th><td><select id=\"id_severity\" name=\"severity\">\n<option value=\"warning\">Warning</option>\n<option value=\"error\" selected=\"selected\">Error</option>\n<option value=\"fatal\">Fatal</option>\n</select></td></tr>\n<tr><th><label for=\"id_active\">Active:</label></th><td><input checked=\"checked\" id=\"id_active\" name=\"active\" type=\"checkbox\" /></td></tr>\n<tr><th><label for=\"id_grace_period\">Grace period:</label></th><td><input id=\"id_grace_period\" name=\"grace_period\" type=\"text\" value=\"00:01:00\" /></td></tr>\n<tr><th><label for=\"id_request.count.min_value\">Request.count.min value:</label></th><td><select id=\"id_request.count.min_value\" name=\"request.count.min_value\">\n<option value=\"0.0000\">0.0000</option>\n<option value=\"3.33\">3.33</option>\n<option value=\"6.67\">6.67</option>\n<option value=\"10.0\">10.0</option>\n</select></td></tr>\n<tr><th><label for=\"id_request.count.max_timeout\">Request.count.max timeout:</label></th><td><input id=\"id_request.count.max_timeout\" min=\"60.0000\" name=\"request.count.max_timeout\" step=\"0.01\" type=\"number\" /></td></tr>\n<tr><th><label for=\"id_response.time.max_value\">Response.time.max value:</label></th><td><input id=\"id_response.time.max_value\" min=\"500.0000\" name=\"response.time.max_value\" step=\"0.01\" type=\"number\" /></td></tr>\n<tr><th><label for=\"id_response.time.min_value\">Response.time.min value:</label></th><td><input id=\"id_response.time.min_value\" name=\"response.time.min_value\" step=\"0.01\" type=\"number\" /></td></tr>\n<tr><th><label for=\"id_network.in.rate.min_value\">Network.in.rate.min value:</label></th><td><input id=\"id_network.in.rate.min_value\" name=\"network.in.rate.min_value\" step=\"0.01\" type=\"number\" /></td></tr>",
    "notification": {
      "grace_period": {
        "seconds": 60,
        "class": "datetime.timedelta"
      },
      "last_send": "2017-09-04T13:13:15.203",
      "description": "detects when requests are not handled",
      "severity": "error",
      "user_threshold": {
        "request.count.max_timeout": {
          "max": null,
          "metric": "request.count",
          "steps": null,
          "description": "No response for at least",
          "min": 60
        },
        "response.time.max_value": {
          "max": null,
          "metric": "response.time",
          "steps": null,
          "description": "Response time is higher than",
          "min": 500
        },
        "request.count.min_value": {
          "max": 10,
          "metric": "request.count",
          "steps": 3,
          "description": "Number of handled requests is lower than",
          "min": 0
        }
      },
      "active": true,
      "id": 2,
      "name": "geonode is not working"
    }
  },
  "success": true
}

Returned keys in data element:

  • fields - list of form fields, including detailed per-resource configuration flags
  • form - rendered user form, which can be displayed
  • notification - serialized notification object with user_thresholds list (this is a base to create fields objects)

Frontend should use either fields (and create whole form in client-side) or form (just put value as html node) values, and submit it to the same url. Form fields created from fields list should use field_name as field name in form.

Notification edition (by user)

Following API call allows user to configure notification by setting receivers and adjust threshold values for checks:

POST /monitoring/api/notifications/config/{{notification_check_id}}/

request.count.max_value=val
description=more tesddddt
request.count.min_value=1
name=new name
severity=error
emails=list of emails

Response contains serialized NotificationCheck in data element, if no errors were captured during form processing:

{
  "status": "ok",
  "errors": {},
  "data": {
    "grace_period": {
      "seconds": 600,
      "class": "datetime.timedelta"
    },
    "last_send": null,
    "description": "more test",
    "severity": "error",
    "user_threshold": {
      "request.count.max_value": {
        "max": null,
        "metric": "request.count",
        "steps": null,
        "description": "Max number of request",
        "min": 1000
      },
      "request.count.min_value": {
        "max": 100,
        "metric": "request.count",
        "steps": null,
        "description": "Min number of request",
        "min": 0
      }
    },
    "id": 293,
    "name": "test"
  },
  "success": true
}

Error (non-200) response will have errors key populated:

{
  "status": "error",
  "errors": {
    "user_threshold": [
      "This field is required."
    ],
    "name": [
      "This field is required."
    ],
    "description": [
      "This field is required."
    ]
  },
  "data": [],
  "success": false
}

Notification creation

This API call allows to create new notification, it's different in form layout from edition:

POST /monitoring/api/notifications/

name=Name of notification (geonode doesn't work)
description=This will check if geonode is serving any data
emails=
user_thresholds=
severity=

Payload elements:

  • name, description are values visible for user
  • severity severity value
  • emails is a list of emails, however, it is encoded to a string, where each email is in new line:
  • user_thresholds is a json encoded list of per-metric-per-check configurations. Each element of list should be a 10-elemnt list, containing:
    • name of metric
    • field check option (one of three values: min_value, max_value or max_timeout)
    • flag, if metric check can use service
    • flag, if metric check can use resource
    • flag, if metric check can use label
    • flag, if metric check can use ows service
    • minimum value for user input (no minimum check if None)
    • maximum value for user input (no maximum check if None)
    • steps count is a number of steps to generate for user input, so user can select value from select list instead of typing. This will have effect only if both min and max values are also provided Sample payload for user_thresholds:
        [('request.count', 'min_value', False, False, False, False, 0, 100, None, "Min number of request"),
         ('request.count', 'max_value', False, False, False, False, 1000, None, None, "Max number of request"),
        ]
    

Response is a serialized NotificationCheck wrapped with standard response envelope (status, errors etc). Actual data is in data key. If processing failed, for example because of form validation errors, response will be non-200 OK, and errors key will be populated.

{
  "status": "ok",
  "errors": {},
  "data": {
    "grace_period": {
      "seconds": 600,
      "class": "datetime.timedelta"
    },
    "last_send": null,
    "description": "more test",
    "user_threshold": {
      "request.count.max_value": {
        "max": 100,
        "metric": "request.count",
        "steps": null,
        "description": "Min number of request",
        "min": 0
      },
      "request.count.min_value": {
        "max": null,
        "metric": "request.count",
        "steps": null,
        "description": "Max number of request",
        "min": 1000
      }
    },
    "id": 257,
    "name": "test"
  },
  "success": true
}
Clone this wiki locally