Skip to content

Alerts Event Standard Operating Procedure

Peter Burkholder (@pburkholder) edited this page Aug 30, 2019 · 17 revisions

Data.gov Escalation Path

Whoever gets the alert would follow this process:

  • Alert received (via slack/email/twitter)
  • Drop an @here note on slack w/ alert content and context #datagov-devsecops channel: "Looking into this"
  • Hit the "Acknowledge" button on the New Relic incident page
  • Open Issue in Data.gov datagov-deploy with link to alert
  • Others acknowledge/verify and offer support
    • Include as additional information in github issue
  • Assign person(s) most capable to solve issue with @user
  • Get written confirmation as a comment that assigned member is handling the issue and they have needed support to resolve.
    • If after hours or during weekend, the assigned person should be phoned to have this immediately brought to their attention if there is no response on the issue in 10 minutes. This step is not complete until there is written confirmation that work is underway.
  • All project team members are responsible for monitoring progress and providing updates/context to github issue with more information as it is received.

Note: If at all possible, there should be two sets of eyes to review the code on any emergency changes pushed straight to production.

Once Resolved:

  • Ensure that issue is fully up to date with all information learned about the problem

  • Schedule retroactive meeting to review response (and update this document with lessons learned) and conduct root cause analysis.

    • write up additional github issue and submit PR if it was preventable via code
    • prioritize and resolve github issues

Runbook

This section covers procedures for common alerts and events.

Generic "catalog.gsa.gov" is down error

Resolution

Likely a SOLR issue, resolved 90% of the time with:

ssh to jump
cd /datagov-deploy/ansible
pipenv shell
ansible solr -m shell -a "service solr restart"
ansible catalog-web -m shell -a "service apache2 restart"

"awscloudfront-AWS-CloudFront-High-5xx-Error-Rate" and "awscloudfront-AWS-CloudFront-High-4xx-Error-Rate" email notifications

These alerts come from our CloudFront AWS account (this is separate from BSP or sandbox). These notifications are only setup for catalog on production. They indicate that catalog-web is having an issue.

Resolution

  • Investigate CKAN error logs on catalog-web hosts /var/log/apache2/ckan.error.log

Note: as of March 2019, the GSA DNS system is having intermittent timeouts that result in 5xx alarm from CloudFront. You can correlate this against Uptrends alerts, which would appear as a DNS Lookup Error.

Status Check Alarm: "DATAGOV-Production-DatagovSolr1StatusCheckAlarm"

These notifications signal that one of the solr hosts is not responding properly. Identify which host it is and reboot the host. Usually there is a New Relic "Host unavailable" alert that goes with it.

Resolution

  • Identify the solr host and reboot ansible-playbook actions/reboot.yml --limit $solr_host

"Reboot required" email notification from unattended-upgrades

unattended-upgrades is an apt package that installs the latest OS packages and security updates. Usually packages are updated and services are restarted without any additional action. If this notification is received, a reboot is required to complete the upgrade.

  • ansible-playbook actions/reboot.yml --limit $host

Note: jumpboxes are not included in the reboot playbook and should be rebooted manually:

  • sudo reboot
  • wait for jumpbox to restart and ensure connectivity.

Broken links detected email notification

TODO

AWS CloudWatch Alarm email notification

These emails have different subject lines e.g "ALARM: "DATAGOV-Production-CatalogPostgresRDSDBInstanceReplica2CPUAlarm..." in US East (N. Virginia)".

The alarms are configured in the BSP environment and therefore need to be edited by creating BSP tickets. The resolution is specific to the alarm.

Clone this wiki locally