Runbook

This document describes common procedures for operation of Data.gov.

BSP access

In order to access production and staging, you must connect to the jumpbox through the GSA VPN. If you have GSA Furnished Equipment (GFE), use AnyConnect. Otherwise, use Horizon.

Horizon requires setting up an SSH client and converting your SSH key to PPK format.

catalog.data.gov

This describes the service running at catalog.data.gov.

Harvest source stats

ckan-php-manager performs several tasks, including generating a report on harvest sources. See README for full instructions.

$ php cli/harvest_stats_csv.php

Columns include:

title
name
url
created
source_type
org title
org name
last_job_started
last_job_finished
total_datasets

Alert conditions

Common alerts we see for catalog.data.gov.

Rapid consumption of memory

Usually manifesting as a New Relic Host Unavailable alarm, the apache2 services (CKAN) consume more and more memory in a short amount of time until they eventually lock up and become unresponsive. This condition seems to affect multiple hosts at the same time.

Resolution

From the jumpbox, reload apache2 using Ansible across the web hosts

$ ansible -m service -a 'name=apache2 state=reloaded' -f 1 catalog-web-v1

For any individual failed hosts, use retry_ssh.sh to repeatedly retry the apache2 restart on the host. Run this in a tmux session to prevent disconnects.
```
$ ./retry_ssh.sh $host sudo service apache2 restart
```
Because the OOM killer might have killed some services in order to recover, reboot hosts as necessary.
```
$ ansible-playbook actions/reboot.yml --limit catalog-web-v1 -e '{"force_reboot": true}'
```

inventory.data.gov

Alerts

This section covers procedures for common alerts and events.

We designate two types of alerts, "Critical" and "Warning".

Critical alerts are the kind of stuff that you should drop what you're doing to address because it means that an outage is happening or an action is required to prevent an outage. Critical alerts should go to #datagov-alerts as well as email.

Warnings are things that indicate a problem, but they can wait until Monday morning to be addressed. We usually send warnings as email notifications to datagovhelp@.

New Relic Incident OPENED:'Host unavailable (production)'

This is a generic alarm that is triggered when a host is not reporting to New Relic for 5 minutes. It can indicate all kinds of issues where the system is not responsive.

Resolution

Check the host in New Relic for obvious issues (high memory or CPU load).
SSH into the host, verify that newrelic-infra is running.
```
$ sudo service newrelic-infra status
```
See role-specific runbooks for more information.
Reboot the host if you still cannot resolve.
If you cannot SSH into the host, open a ticket with FCS.

Generic "catalog.gsa.gov" is down error from Uptrends

Resolution

Investigate CKAN error logs on catalog-web hosts dsh -g catalog-web -c -M sudo tail -f /var/log/apache2/ckan.error.log

If it appears to be a solr issue, restart the solr service.

ssh to jump
cd datagov-deploy/ansible
pipenv shell
ansible solr -m service -a "name=solr state=restarted" -f 1
ansible catalog-web -m service -a "name=apache2 state=restarted" -f 1
ansible inventory-web -m service -a "name=apache2 state=restarted" -f 1

"awscloudfront-AWS-CloudFront-High-5xx-Error-Rate" and "awscloudfront-AWS-CloudFront-High-4xx-Error-Rate" email notifications

These alerts come from our CloudFront AWS account (this is separate from BSP or sandbox). These notifications are only setup for catalog on production. They indicate that catalog-web is having an issue.

Resolution

Investigate CKAN error logs on catalog-web hosts dsh -g catalog-web -c -M sudo tail -f /var/log/apache2/ckan.error.log

Note: as of March 2019, the GSA DNS system is having intermittent timeouts that result in 5xx alarm from CloudFront. You can correlate this against Uptrends alerts, which would appear as a DNS Lookup Error.

Status Check Alarm: "DATAGOV-Production-DatagovSolr1StatusCheckAlarm"

These notifications signal that one of the solr hosts is not responding properly. Identify which host it is and reboot the host. Usually there is a New Relic "Host unavailable" alert that goes with it.

Resolution

Identify the solr host and reboot ansible-playbook actions/reboot.yml --limit $solr_host

"Host needs a reboot" email notification from unattended-upgrades/reboot-notifier

unattended-upgrades is an apt package that installs the latest OS packages and security updates. Usually packages are updated and services are restarted without any additional action. If this notification is received, a reboot is required to complete the upgrade. reboot-notifier will also send out a notification if a reboot is required for some other reason.

ansible-playbook actions/reboot.yml --limit $host

Note: jumpboxes are not included in the reboot playbook and should be rebooted manually:

sudo reboot
wait for jumpbox to restart and ensure connectivity.

If any host does not come back up, open a ticket with BSP.

Broken links detected email notification

Broken links were detected on pages within WordPress. The WP content manager is responsible for following up on these notifications.

AWS CloudWatch Alarm email notification

These emails have different subject lines e.g "ALARM: "DATAGOV-Production-CatalogPostgresRDSDBInstanceReplica2CPUAlarm..." in US East (N. Virginia)".

The alarms are configured in the BSP environment and therefore need to be edited by creating BSP tickets. The resolution is specific to the alarm.

Troubleshooting Nessus scans

SecOps performs regular scans on our hosts. Occasionally, there is an issue with the nessus agent and the ISSO might contact us regarding specific IPs that could not be authenticated.

Lookup the host IP address in our System Inventory to confirm it is still a valid host
Connect to the host and run sudo /opt/nessus_agent/sbin/nessuscli agent status

You should see that it is linked to a Helix endpoint and no errors.

If you need to re-link the nessus agent, run common playbook.

ansible-playbook common.yml --tags nessus --limit $host

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runbook

BSP access

catalog.data.gov

Harvest source stats

Alert conditions

Rapid consumption of memory

Resolution

inventory.data.gov

Alerts

New Relic Incident OPENED:'Host unavailable (production)'

Resolution

Generic "catalog.gsa.gov" is down error from Uptrends

Resolution

"awscloudfront-AWS-CloudFront-High-5xx-Error-Rate" and "awscloudfront-AWS-CloudFront-High-4xx-Error-Rate" email notifications

Resolution

Status Check Alarm: "DATAGOV-Production-DatagovSolr1StatusCheckAlarm"

Resolution

"Host needs a reboot" email notification from unattended-upgrades/reboot-notifier

Broken links detected email notification

AWS CloudWatch Alarm email notification

Troubleshooting Nessus scans

Clone this wiki locally