diff --git a/doc/event-mgmt-framework/event-alarm-framework.md b/doc/event-mgmt-framework/event-alarm-framework.md index 7a509a489c..ec47834de7 100644 --- a/doc/event-mgmt-framework/event-alarm-framework.md +++ b/doc/event-mgmt-framework/event-alarm-framework.md @@ -46,9 +46,13 @@ Event and Alarm Framework * [3.3.2.3 Show Commands](#3323-show-commands) * [3.3.3 REST API Support](#333-rest-api-support) * [4 Flow Diagrams](#4-flow-diagrams) - * [5 Warm Boot Support](#5-warm-boot-support) - * [5.1 Application warm boot](#51-application-warm-boot) - * [5.2 eventd warm boot](#52-eventd-warm-boot) + * [5 Persistence](#5-Persistence) + * [5.1 Warm reboot](#51-Warm-reboot) + * [5.1.1 Application restart](#511-Application-restart) + * [5.1.2 Eventd service restart](#512-Eventd-service-restart) + * [5.1.3 System warm reboot ](#513-System-warm-reboot) + * [5.2 Fast reboot](#52-Fast-reboot) + * [5.3 Cold reboot](#53-Cold-reboot) * [6 Scalability](#6-scalability) * [7 Showtech Support](#7-showtech-support) * [8 Unit Test](#8-unit-test) @@ -148,7 +152,7 @@ As mentioned above, each event has an important characteristic: severity. SONiC The following describes how an alarm transforms and how various tables are updated. ![Alarm Life Cycle](event-alarm-framework-alarm-lifecycle.png) -By default every event will have a severity assigned by the component. The framework provides Event Profiles to customize severity of an event and also disable an event. +By default every event will have a severity assigned by the component. The framework provides Event Profile to customize severity of an event and also disable an event. Template for event profile is as below: ``` @@ -163,14 +167,9 @@ Template for event profile is as below: ] } ``` -Event Profiles only contains declarations of events and their characteristics. There has to be an application to raise these events using eventnotify API. +Event Profile only contains declarations of events and their characteristics. There has to be an application to raise these events using eventnotify API. The framework maintains default event profile at /etc/evprofile/default.json. -Operator can download default event profile to a remote host. -This downloaded file can be modified by changing the severity or enable flag of event(s). -This modified file can then be uploaded to the device to /etc/evprofile/. -Operator can select any of these custom event profiles to change default properties of events. -The selected profile is persistent across reboots and will be in effect until operator selects either default or another custom profile. In addition to storing events in DB, framework forwards log messages corresponding to all the events to syslog. Syslog message displays the type (ALARM or EVENT), action (RAISE, CLEAR, ACKNOWLEDGE or UNACKNOWLEDGE) - when the message corresponds to an event of an alarm, name of the event and detailed message. @@ -220,10 +219,8 @@ Application owners need to identify various conditions that would be of interest | 8 | CLI commands | | | 8.1 | show alarm [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number \| all] | | | 8.2 | show event [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number ] | | -| 8.3 | show event profile | | -| 8.4 | alarm acknowledge | | -| 8.5 | logging server [ log \| event ] | default is 'log' | -| 8.6 | event profile [ default \| name-of-file ] | | +| 8.3 | alarm acknowledge | | +| 8.4 | logging server [ log \| event ] | default is 'log' | | 9 | gNMI subscription | | | 9.1 | Subscribe to openconfig Event container and Alarm container. All events and alarms published to gNMI subscribed clients. | | | 10 | Clear all events | | @@ -272,12 +269,6 @@ Applications act as producers of events. Event consumer class in eventd container receives and processes the received event. Event consumer manages received events, updates event history table, current alarm table, event_stats table and alarm_stats tables and invokes logging API, which constructs message and sends it over to syslog. -Operator can chose to change properties of events with the help of event profile. Default -event profile is available at */etc/evprofile/default.json*. User can download the default event profile, -modify and upload it back to the switch to apply it. - -Through event profile, user can change severity of any event and also can enable/disable a event. - Through CLI, REST or gNMI, event history table and current alarm table can be retrieved using various filters. ### 3.1.1 Event Producers @@ -294,21 +285,17 @@ Developers of new events or alarms need to update this file by declaring name an ``` { - "__README__" : "This is default map of events that eventd uses. Developer can modify this file and send - SIGINT to eventd to make it read and use the updated file. Alternatively developer can test - the new event by adding it to a custom event profile and use 'event profile ' command - to apply that profile without sending SIGINT to eventd. Developer need to commit default.json file - with the new event after testing it out. + "__README__" : "This is default map of events that eventd uses. Supported severities are: CRITICAL, MAJOR, MINOR, WARNING and INFORMATIONAL. Supported enable flag values are: true and false.", "events":[ - { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "severity" : "INFORMATIONAL", - "enable" : "true", - "message" : "Custom Event Profile is applied." - }, - { + { + "name" : "SYSTEM_STATUS", + "severity" : "INFORMATIONAL", + "enable" : "true", + "message" : "System Status Information" + }, + { "name": "TEMPERATURE_EXCEEDED", "severity": "CRITICAL", "enable": "true" @@ -330,7 +317,11 @@ definition: Usage: For one-shot events: ``` - LOG_EVENT(CUSTOM_EVPROFILE_CHANGE, profile_name.c_str(), NOTIFY, "New event profile is %s", profile_name.c_str()); + LOG_EVENT(INTERFACE_OPER_STATUS_CHANGE, portname_p, NOTIFY, + "Interface %s oper state changed from %s to %s", + portname_p, oper_status_strings.at(prev).c_str(), + oper_status_strings.at(current).c_str()); + ``` For alarms: @@ -376,7 +367,6 @@ The EVENTPUBSUB table uses event-id and a sequence-id generated locally by event The event consumer is a class in sonic-eventd container that processes the incoming record. On intitialization, event consumer reads */etc/evprofile/default.json* and builds an internal map of events, called *static_event_map*. -It then verifies if there was a custom event profile configured and merges its contents to static_event_map built from default event profile. It then reads from EVENTPUBSUB table. This table contains records that are published by applications and waiting to be read by eventd. Whenever there is a new record, event consumer reads the record, processes and deletes it. @@ -507,9 +497,7 @@ send it to syslog. } ``` An example of syslog message generated for an event raised when user selects a custom event profile. -``` -May 19 21:22:07.122786 2021 sonic WARNING eventd#eventd[2419]: [EVENT], %CUSTOM_EVPROFILE_CHANGE :- handle_custom_evprofile: Custom Event Profile myprofile.json is applied.. Custom Event Profile is selected by user. -``` + Syslog message for an alarm raised by a sensor: ``` May 19 21:42:14.373410 2021 sonic ALERT eventd#eventd[2453]: [ALARM] (RAISE), %TEMPERATURE_EXCEEDED :- temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees. @@ -563,28 +551,24 @@ One way to fix is to have the system monitor daemon to periodically (very high p ### 3.1.5 Event Profile The Event profile contains mapping between event-id and severity of the event, enable flag. -Through event profile, operator can change severity of a particular event. And can also enable/disable -a particular event. + The default profile exists at */etc/evprofile/default.json* -By default, every event is enabled. The severity of event is decided by developer while adding the event. ``` { "__README__" : "This is default map of events that eventd uses. Developer can modify this file and send - SIGINT to eventd to make it read and use the updated file. Alternatively developer can test - the new event by adding it to a custom event profile and use 'event profile ' command - to apply that profile without sending SIGINT to eventd. Developer need to commit default.json file + SIGINT to eventd to make it read and use the updated file. Developer need to commit default.json file with the new event after testing it out. Supported severities are: CRITICAL, MAJOR, MINOR, WARNING and INFORMATIONAL. Supported enable flag values are: true and false.", "events":[ { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "severity" : "INFORMATIONAL", - "enable" : "true", - "message" : "Custom Event Profile is applied." - }, + "name" : "SYSTEM_STATUS", + "severity" : "INFORMATIONAL", + "enable" : "true", + "message" : "System Status Information" + }, { "name": "TEMPERATURE_EXCEEDED", "severity": "CRITICAL", @@ -594,57 +578,6 @@ The severity of event is decided by developer while adding the event. ] } ``` -User can download the default event profile to a remote host. User can modify characteristics of -some/all events in the profile and can upload it back to the switch and place the file at /etc/evprofile/. - -The uploaded profile will be called custom event profile. - -An example of custom event profile is as below. -With this particular custom event profile, user wants to -- change severity of CUSTOM_EVPROFILE_CHANGE event (severity changed from INFORMATIONAL to MAJOR) -- suppress the TEMPERATURE_EXCEEDED alarm (enable flag is changed from true to false) -- introduce new alarm by name DUMMY_ALARM (there should be an application to raise/clear this new alarm). -``` -{ - "events": [ - { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "severity" : "MAJOR", - "enable" : "true", - }, - { - "name": "TEMPERATURE_EXCEEDED", - "severity": "CRITICAL", - "enable": "false" - }, - { - "name" : "DUMMY_ALARM", - "severity" : "WARNING", - "enable" : "true", - } - ] -} -``` - -User can have multiple custom profiles and can select any of the profiles under /etc/evprofile/ using 'event profile' command. - -The framework will sanity check the user selected profile and merges it map of events *static_event_map* maintained by eventd. - -After a successful sanity check, the framework generates an event indicating that a new profile is in effect. - -If there are any outstanding alarms in the current alarm table, the framework removes those records for which enable is set to false in the new profile. -Severity counters in ALARM_STATS are reduced accordingly. - -Eventd starts using the merged map of characteristics for the all the newly generated events. A CUSTOM_EVPROFILE_CHANGE event is generated. - -The event profile is upgrade and downgrade compatible by accepting only those attributes that are *known* to eventd. -All the other attributes will remain to their default values. - -Sanity check rejects the profile if attributes contains values that are not known to eventd. - -Config Migration hooks will be used to persist custom profiles across an upgrade. - -The profile can also be applied through ztp. ### 3.1.6 CLI The show CLI require many filters with range specifiers. @@ -703,20 +636,24 @@ action : Indicates action of the event; for one-shot events, it is empt resource : Object which generated the event {string} severity : Severity of the event {string} -127.0.0.1:6379[6]> hgetall "EVENT|1" - 1) "text" - 2) ":- handle_custom_evprofile: Custom Event Profile x.json is applied." - 3) "type-id" - 4) "CUSTOM_EVPROFILE_CHANGE" - 5) "id" - 6) "1" - 7) "time-created" - 8) "1621459327118629520" - 9) "resource" -10) "/etc/evprofile/x.json" +127.0.0.1:63794[15]> hgetall EVENT|21 +1) "time-created" +2) "1696888244771929600" +3) "type-id" +4) "PSU_POWER_STATUS" +5) "text" +6) "PSU 1 is out of power." +7) "action" +8) "RAISE" +9) "resource" +10) "PSU 1" 11) "severity" 12) "WARNING" -127.0.0.1:6379[6]> +13) "id" +14) "21" +15) "acknowledged" +16) "false" + ``` Schema for EVENT_STATS table is as follows: @@ -842,8 +779,6 @@ Table EVENTPUBSUB is used for applications to write events and for eventd to acc Event History Table (EVENT) and Current Alarm Table (ALARM) are used to house events and alarms respectively. To maintain various statistics of events, these two tables are used : EVENT_STATS and ALARM_STATS. -EVPROFILE table is used by mgmt-framework to communicate name of the custom event profile when configured through NBI. -Eventd reads the file name from this table and merges it with its static_event_map. ## 3.3 User Interface ### 3.3.1 Data Models @@ -966,21 +901,6 @@ module: sonic-alarm +--ro acknowledge-time? event:timeticks64 ``` -Following is for sonic yang to support event profiles. -``` -module: sonic-evprofile - - rpcs: - +---x get-evprofile - | +--ro output - | +--ro file-name? string - | +--ro file-list* string - +---x set-evprofile - +---w input - | +---w file-name? string - +--ro output - +--ro status? string -``` openconfig alarms yang is defined at [here](https://github.com/openconfig/public/blob/master/release/models/system/openconfig-alarms.yang) @@ -1003,12 +923,6 @@ Un-acknowledging an alarm updates alarm statistics and thereby applications like The alarm record in the ALARM table is marked with acknowledged field set to false. There is acknowledge-time field that indicates when that alarm is un-acknowledged. -``` -sonic# event profile -``` -The command takes name of specified file, validates it for its syntax and values; merges it with its internal static map of events *static_event_map*. - - #### 3.3.2.2 Configuration Commands ``` sonic(config)# logging server [log|event] @@ -1019,56 +933,46 @@ Support with VRF/source-interface and configuring remote-port are all backward c #### 3.3.2.3 Show Commands ``` -sonic# show event profile --------------------------- -Active Event Profile --------------------------- -myProfile.json --------------------------- -Available Event Profiles --------------------------- -default.json -myProfile.json -userProfile.json - sonic# show event [ details | summary | severity | start end | recent <5min|60min|24hr> | id | from to ] 'show event' commands would display all the records in EVENT table. sonic# show event ----------------------------------------------------------------------------------------------------------------------------- -Id Action Severity Name Timestamp Description ----------------------------------------------------------------------------------------------------------------------------- -1 - WARNING CUSTOM_EVPROFILE_CHANGE 2021-05-19T21:38:27.455Z :- handle_custom_evprofile: Custom Event Profile x.json is applied. -2 RAISE CRITICAL DUMMY_ALARM 2021-05-19T21:39:31.622Z :- signalHandler: Raising simulated alarm -3 CLEAR CRITICAL DUMMY_ALARM 2021-05-19T21:42:34.371Z :- signalHandler: Clearing simulated alarm -4 RAISE CRITICAL DUMMY_ALARM 2021-05-19T21:46:14.371Z :- signalHandler: Raising simulated alarm -5 ACKNOWLEDGE CRITICAL DUMMY_ALARM 2021-05-19T21:48:05.845Z Alarm id 4 ACKNOWLEDGE. -6 UNACKNOWLEDGE CRITICAL DUMMY_ALARM 2021-05-19T21:53:24.484Z Alarm id 4 UNACKNOWLEDGE. -7 CLEAR CRITICAL DUMMY_ALARM 2021-05-19T21:55:54.977Z :- signalHandler: Clearing simulated alarm +---------------------------------------------------------------------------------------------------- +Id Action Severity Name Timestamp +---------------------------------------------------------------------------------------------------- +1 RAISE WARNING PSU_POWER_STATUS 2023-10-09T21:50:44.771Z +2 - INFORMATIONAL SYSTEM_STATUS 2023-10-09T21:51:02.784Z +3 RAISE CRITICAL DUMMY_ALARM 2023-15-19T21:39:31.622Z +4 CLEAR CRITICAL DUMMY_ALARM 2023-15-19T21:42:34.371Z +5 RAISE CRITICAL DUMMY_ALARM 2023-15-19T21:46:14.371Z +6 ACKNOWLEDGE CRITICAL DUMMY_ALARM 2023-15-19T21:48:05.845Z +7 UNACKNOWLEDGE CRITICAL DUMMY_ALARM 2023-15-19T21:53:24.484Z +8 CLEAR CRITICAL DUMMY_ALARM 2023-15-19T21:55:54.977Z sonic# show event details ---------------------------------------------- Event Details - 1 ---------------------------------------------- -Id: 1 -Action: - -Severity: WARNING -Type: CUSTOM_EVPROFILE_CHANGE -Timestamp 2021-05-19T21:38:27.455Z -Description: :- handle_custom_evprofile: Custom Event Profile x.json is applied. -Source: /etc/evprofile/x.json +Id: 1 +Action: RAISE +Severity: WARNING +Type: PSU_POWER_STATUS +Timestamp: 2023-10-09T21:50:44.771Z +Description: PSU 1 is out of power. +Source: PSU 1 + ---------------------------------------------- Event Details - 2 ---------------------------------------------- Id: 2 Action: RAISE -Severity: CRITICAL -Type: DUMMY_ALARM -Timestamp 2021-05-19T21:39:31.622Z -Description: :- signalHandler: Raising simulated alarm -Source: simulation +Severity: INFORMATIONAL +Type: SYSTEM_STATUS +Timestamp 2023-10-09T21:51:02.784Z +Description: System is ready +Source: system_status ---------------------------------------------- Event Details - 3 @@ -1077,8 +981,8 @@ Id: 3 Action: CLEAR Severity: CRITICAL Type: DUMMY_ALARM -Timestamp 2021-05-19T21:42:34.371Z -Description: :- signalHandler: Clearing simulated alarm +Timestamp 2023-15-19T21:42:34.371Z +Description: signalHandler: Clearing simulated alarm Source: simulation sonic# show event summary @@ -1200,8 +1104,6 @@ sonic REST links: * /restconf/data/sonic-event:sonic-event/EVENT_STATS/EVENT_STATS_LIST * /restconf/data/sonic-alarm:sonic-alarm/ALARM/ALARM_LIST * /restconf/data/sonic-alarm:sonic-alarm/ALARM_STATS/ALARM_STATS_LIST -* /restconf/operations/sonic-evprofile:get-evprofile -* /restconf/operations/sonic-evprofile:set-evprofile * /restconf/operations/sonic-alarm:acknowledge-alarms * /restconf/operations/sonic-alarm:unacknowledge-alarms @@ -1214,15 +1116,21 @@ openconfig REST links: # 4 Flow Diagrams ![Sequence Diagram](event-alarm-framework-seqdiag.png) -# 5 Warm Boot Support -## 5.1 Application warm boot -Applications confirming to the warm boot, should have stored their state and compare current values against previous values. +# 5 Persistence +Alarms and Events are stored in ALARM and EVENT tables in a separate Redis DB instance called EventDB. +This instance is configured to periodically persist the EventDB to disk. +It is configured to persist 75 redis db events at 180 seconds. This is equal to ~5-6 Sonic Events. + + +## 5.1 Warm reboot +### 5.1.1 Application restart +Applications confirming to restart, should store their state and compare current values against previous values. Such compliant application also "remembers" that it raised an event before for a specific condition. -They would -* not raise alarms/events for the same condition that it raised pre warm boot -* clear those alarms once current state of a particular condition is recovered (by comparing against the stored state). +They would +* not raise alarms/events for the same condition that it raised pre restart. +* clear those alarms once current state of a particular condition is recovered (by comparing against the stored state). -## 5.2 eventd warm boot +### 5.1.2 Eventd service restart Records from applications are stored in a table, called EVENTPUBSUB. Records that are being written will be queued when the consumer (eventd) is down. @@ -1231,6 +1139,23 @@ During normal operation, eventd reads, processes whenever a new record is added When eventd is restarted, events and alarms raised by applications will be waiting in a queue while eventd is coming up. When eventd eventually comes back up, it reads those records in the queue. +### 5.1.3 System warm reboot +On system warm reboot the current EventDB is persisted on disk without the ALARM and ALARM_STATS table. Applications should check the condition after restart, and raise the alarm if condition exists. +Only EVENT table is persisted on disk across system warmboot. This overwrites the DB file from periodic persistence. + + +## 5.2 Fast reboot +On system fast reboot the current EVENT and EVENT_STATS table from EventDB are persisted on disk. ALARM and ALARM_STATS table are not persisted. Applications have to raise alarm on restart if condition exists. +The Event DB is stored on disk prior to control plane protocol services shutdown, to not impact fast-boot times. This overwrites the DB from periodic persistence. + +## 5.3 Cold reboot +The current EVENT and EVENT_STATS table are persisted on disk across cold boot. ALARM and ALARM_STATS table are not persisted, and applications have to raise alarm on restart if condition exists. This overwrites the DB from periodic persistence. + +## 5.4 Power reset +In power reset, the EventDB is loaded from the DB on disk. This DB is from periodic persistence. The ALARM and ALARM_STATS table is removed from the table. +Applications have to raise alarm on restart if condition exists. In this case, there can be events missing from previous boot, as the reset may have happened within the periodic persistence timer interval. + + # 6 Scalability In this feature, scalability applies to Event History Table (EVENT). As it is persistent and it records every event generated on the system, to protect against it growing indefinitely, user can limit its size through a manifest file. @@ -1250,10 +1175,5 @@ The second command displays all the alarms that are waiting to be cleared by app - Verify wrap around for EVENT table ( change manifest file to a lower range and trigger that many events ) - Verify sequence-id for events is persistent by restarting - Verify counters by raising various alarms with different severities -- Change severity of an event through custom event profile and verify it is logged at specified severity -- Change enable/disable of an event through custom event profile and verify it is suppressed -- Verify custom event profile with an invalid severity is rejected -- Verify custom event profile with an invalid enable/disable flag is rejected -- Verify custom event profile is persisted after a reboot - Verify various show commands - Verify 'logging-server event' command forwards only event log messages to the host