Skip to content

inventory.data.gov

James Brown edited this page Mar 16, 2022 · 26 revisions

a.k.a Inventory is used by federal agencies to manage metadata for their datasets. Inventory is used to generate the agency's data.json which must be hosted on the agency's website (e.g. agency.gov/data.json). Inventory is a CKAN instance and can be used to host datasets in addition to metadata.

Access

Access to Inventory is historically confusing. There are several mechanisms referring to public/private and each means something different. Since Inventory contains only open data, all data within Inventory can be publicly accessible. However for historical reasons, datasets are only visible to authenticated users while resources are publicly visible.

  • CKAN private: true a property on the dataset which is seen on the organization datasets listing and affects visibility within CKAN. Only members of an organization can see private datasets within their own organizations. Since Inventory only contains open data and any metadata is published on catalog.data.gov, there is no reason to mark a dataset as private. The CKAN private field is ignored by Inventory's data.json export.
  • DCAT-US “public access level” doesn’t mean anything to CKAN and does not affect how the dataset is displayed within Inventory. A dataset with "accessLevel": "non-public" will still appear in the data.json inventory.
  • Publishing status (draft or published) is an Inventory concept which affects whether or not the dataset is included in the data.json when exporting from Inventory. It does not affect visibility within CKAN. Any authenticated users will still be able to see draft datasets. Public users be able to see resources of draft datasets. TODO: is this defined in ckanext-datajson?

Resources, the actual data files uploaded to individual datasets, do not have a concept of private and inherit visibility settings from the dataset. Any dataset that includes resource files hosted on Inventory must be marked private: false, otherwise the resource files will not be accessible to anonymous users. This includes some of GSA's hosted datasets that are available by download or the datastore API.

Why are datasets not visible to the public? 🤷 maybe because the confusion around Publishing Status. See GSA/data.gov#2095.

Environments

Instance Url
Production inventory.data.gov
Staging inventory-datagov.dev-ocsit.bsp.gsa.gov
sandbox inventory.sandbox.datagov.us

Dependencies

Sub-components:

  • ckan
  • datapusher

Services:

  • apache2
  • rds
  • redis
  • s3
  • solr

Logs

  • /var/log/inventory/ckan.custom.log
  • /var/log/inventory/ckan.error.log
  • /var/log/inventory/datapusher.custom.log
  • /var/log/inventory/datapusher.error.log

Common tasks

Importing from data.json

ckanpyimport is used in onboarding new agencies to inventory.data.gov. This tool imports datasets from a data.json file.

The import script will happily create duplicates, so if there are any existing datasets in the organization, you probably should delete them all first.

Run this from the jumpbox using nohup or tmux so that disconnecting your session does not interrupt the script. The script can take a while depending on how many packages need to be imported (~2 hours for 1000 datasets). You should also test against staging before running against production.

Alert conditions

Health Checks

The netscaler configuration verifies that sites are working and directs traffic only to working machines. To check that the server is responding appropriately, Netscaler checks with request HEAD https://{host_name}/api/action/status_show endpoint, expecting a 200 response code, to verify it is working as expected. Latest health check configuration.

Database Initialization

To start a new system from scratch, you'll need to do some manual SQL work.

First, you'll need to manually create the 3 databases: inventory ckan, inventory datastore, and datapusher (all at the same endpoint). If a deploy is run, you can find those credentials in the production.ini and datapusher.wsgi files.

Once those are created, you'll need the following postgres usernames/accounts: ckan (should already exist in sandbox as master postgres user); datastore user, datastore read_only, and datapusher (pull usernames and passwords from the same).

You should be able to setup the permissions appropriately using the set_permissions.sql. You'll need to insert the correct usernames.

You can use ^^ as a framework for setting up the datapusher postgres account to the datapusher database.

Now the playbook/deploy should be able to be run normally.

Clone this wiki locally