Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datajson Validator Documentation #123

Merged
merged 7 commits into from
Oct 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 29 additions & 171 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,36 +12,36 @@ Plugin `datajson` provides a harvester to import datasets from other
remote /data.json files. See below for setup instructions.

And the plugin also provides a new view to validate /data.json files
at http://ckanhostname/pod/validate.
at http://ckanhostname/dcat-us/validator.


## Features

_TODO_

Three plugins are provided.

- **datajson** provides data.json export and DCAT-US metadata UI integration
- **datajson_harvest** extends [ckanext-harvest](https://github.com/ckan/ckanext-harvest/) to collect metadata from
remote data.json sources
- **cmsdatanav_harvest** _???_
- [:heavy_check_mark:] **datajson** provides data.json export and DCAT-US metadata UI integration
- Read more about [`datajson`](docs/datajson.md)
- [:heavy_check_mark:] **datajson_harvest** extends [ckanext-harvest](https://github.com/ckan/ckanext-harvest/)
to collect metadata fromremote data.json sources
- Read more about [`datajson_harvest`](docs/datajson_harvest.md)
- [:warning:] **cmsdatanav_harvest** extends [ckanext-harvest](https://github.com/ckan/ckanext-harvest/)
to collect metadata from for the CMS Data Navigator catalog
- [:heavy_check_mark:] **datajson_validator** provides a web form to validate dcat-us metadata data.json compliance.
- Read more about [`datajson_validator`](docs/datajson_validator.md)


## Usage


### Requirements

All requirements are tracked `setup.py` when possible. Some CKAN extensions are not on PyPI, so they
(and their dependencies) must be tracked in `requirements.txt`.
- [ckanext-harvest](https://github.com/ckan/ckanext-harvest/)

This extension is compatible with these versions of CKAN.

CKAN version | Compatibility
------------ | -------------
<=2.7 | no
2.8 | yes
2.9 | [in progress](https://github.com/GSA/datagov-ckan-multi/issues/564)

CKAN version | Compatibility
-------------- | -------------
<=2.7 | :x:
2.8 | :warning:
2.9.5 | :heavy_check_mark:
2.9.6 | :heavy_check_mark:

### Installation

Expand All @@ -62,138 +62,10 @@ That's the plugin for /data.json output. To make the harvester available,
also add:

ckan.plugins = (other plugins here...) harvest datajson_harvest

If you're running CKAN via WSGI, we found a strange Python dependency
bug. It might only affect development environments. The fix was to
revise wsgi.py and add:

import ckanext

before

from paste.deploy import loadapp

Then restart your server and check out:

http://yourdomain.com/data.json
and
http://yourdomain.com/data.jsonld
and
http://yourdomain.com/pod/validate


### Caching /data.json

If you're deploying inside Apache, some caching would be a good idea
because generating the /data.json file can take a good few moments.
Enable the cache modules:

a2enmod cache
a2enmod disk_cache

And then in your Apache configuration add:

CacheEnable disk /data.json
CacheRoot /tmp/apache_cache
CacheDefaultExpire 120
CacheMaxFileSize 50000000
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheStoreNoStore On

And be sure to create /tmp/apache_cache and make it writable by the Apache process.


### Generating /data.json Off-Line

Generating this file is a little slow, so an alternative instead of caching is
to generate the file periodically (e.g. in a cron job). In that case, you'll want
to change the path that CKAN generates the file at to something *other* than /data.json.
In your CKAN .ini file, in the app:main section, add:

ckanext.datajson.path = /internal/data.json

Now create a crontab file ("mycrontab") to download this URL to a file on disk
every ten minutes:

0-59/10 * * * * wget -qO /path/to/static/data.json http://localhost/internal/data.json

And activate your crontab like so:

crontab mycrontab

In Apache, we'll want to block outside access to the "internal" URL, and also
map the URL /data.json to the static file. In your httpd.conf, add:

Alias /data.json /path/to/static/data.json

<Location /internal/>
Order deny,allow
Allow from 127.0.0.1
Deny from all
</Location>

And then restart Apache. Wait for the cron job to run once, then check if
/data.json loads (and it should be fast!). Also double check that
http://yourdomain.com/internal/data.json gives a 403 forbidden error when
accessed from some other location.


### Configuration

You can customize the URL that generates the data.json output:

ckanext.datajson.path = /data.json
ckanext.datajsonld.path = /data.jsonld
ckanext.datajsonld.id = http://www.youragency.gov/data.json

You can enable or disable the Data.json output by setting

ckanext.datajson.url_enabled = False

If ckanext.datajsonld.path is omitted, it defaults to replacing ".json" in your
ckanext.datajson.path path with ".jsonld", so it probably won't need to be
specified.

The option ckanext.datajsonld.id is the @id value used to identify the data
catalog itself. If not given, it defaults to ckan.site_url.

You can specify which export map file to use to generates the data.json

ckanext.datajson.export_map_filename = export.map.json

There are three map files available in folder [export_map](https://github.com/GSA/ckanext-datajson/tree/main/ckanext/datajson/export_map)
to choose from, or you can add you own in the same folder. By default, it looks
for file `export.map.json`, if not found, it defaults to
`export.catalog.map.sample.json`.

### Harvesting
To make the datajson validator route and web form available, also add:

To use the data.json harvester, you'll also need to set up the CKAN harvester
extension. See the CKAN harvester README at https://github.com/okfn/ckanext-harvest
for how to do that. You'll set some configuration variables and then initialize the
CKAN harvester plugin using:

paster --plugin=ckanext-harvest harvester initdb --config=/path/to/ckan.ini

Now you can set up a new DataJson harvester by visiting:

http://yourdomain.com/harvest

And when configuring the data source, just choose "/data.json" as the source type.

**The next paragraph assumes you're using my fork of the CKAN harvest extension
at https://github.com/JoshData/ckanext-harvest**

In the configuration field, you can put a YAML string containing defaults for fields
that may not be set in the source data.json files, e.g. enter something like this:

defaults:
Agency: Department of Health & Human Services
Author: Substance Abuse & Mental Health Services Administration
author_id: http://healthdata.gov/id/agency/samhsa

This again is tied to the HealthData.gov metadata schema.
ckan.plugins = (other plugins here...) datajson_validator


## Development
Expand All @@ -212,11 +84,11 @@ CKAN will start at [localhost:5000](http://localhost:5000/).

Clean up any containers and volumes.

$ make down
$ make clean

Open a shell to run commands in the container.

$ docker-compose exec ckan bash
$ docker-compose exec app /bin/bash

If you're unfamiliar with docker-compose, see our
[cheatsheet](https://github.com/GSA/datagov-deploy/wiki/Docker-Best-Practices#cheatsheet)
Expand All @@ -230,7 +102,7 @@ For additional make targets, see the help.
### Testing

They follow the guidelines for [testing CKAN
extensions](https://docs.ckan.org/en/2.8/extensions/testing-extensions.html#testing-extensions).
extensions](https://docs.ckan.org/en/2.9/extensions/testing-extensions.html#testing-extensions).

To run the extension tests, start the containers with `make up`, then:

Expand All @@ -243,13 +115,7 @@ Lint the code.

### Matrix builds

The existing development environment assumes a full catalog.data.gov test setup. This makes
it difficult to develop and test against new versions of CKAN (or really any
dependency) because everything is tightly coupled and would require us to
upgrade everything at once which doesn't really work. A new make target
`test-new` is introduced with a new docker-compose file.

The "new" development environment drops as many dependencies as possible. It is
The test development environment drops as many dependencies as possible. It is
not meant to have feature parity with
[GSA/catalog.data.gov](https://github.com/GSA/catalog.data.gov/). Tests should
mock external dependencies where possible.
Expand All @@ -258,20 +124,12 @@ In order to support multiple versions of CKAN, or even upgrade to new versions
of CKAN, we support development and testing through the `CKAN_VERSION`
environment variable.

$ make CKAN_VERSION=2.8 test
$ make CKAN_VERSION=2.9.5 test
$ make CKAN_VERSION=2.9 test


Legacy nose tests are still supported. You must specify `COMPOSE_FILE=docker-compose.legacy.yml`
when interacting with this environment.

$ make COMPOSE_FILE=docker-compose.legacy.yml up
$ make COMPOSE_FILE=docker-compose.legacy.yml test-legacy

Variable | Description | Default
-------- | ----------- | -------
CKAN_VERSION | Version of CKAN to use. | 2.8
COMPOSE_FILE | docker-compose service description file. | docker-compose.yml

Note: When testing patch versions of CKAN, the services may not have patch releases.
So, take note of the `SERVICES_VERSION` variable which tracks the minor release to
pull for the `db` and `solr` images.


## Credit / Copying
Expand Down
7 changes: 0 additions & 7 deletions ckanext/datajson/blueprint.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,11 +360,6 @@ def validator():

grouped_errors = {}
for error in errors:
# print(error)
print("....................................")
print(error.absolute_path)
# print(error.instance)
print(error.message)
if error.absolute_path == deque([]):
key = "The root of data.json"
else:
Expand All @@ -377,8 +372,6 @@ def validator():
c.errors.append((
'%s has a problem' % path,
['%s.' % e.message for e in errors]))
for suberror in sorted(error.context, key=lambda e: e.schema_path):
print(list(suberror.schema_path), suberror.message, sep=", ")

except Exception as e:
c.errors.append(("Internal Error", ["Something bad happened: " + str(e)]))
Expand Down
4 changes: 2 additions & 2 deletions ckanext/datajson/tests/test_datajson_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def test_data_json_missing_dataset_fields(self, app):
''' Test that an invalid data.json that is missing dataset fields fails '''

res = app.post('/dcat-us/validator', data={
'url': ('https://raw.githubusercontent.com/GSA/ckanext-datajson/datajson-validator/ckanext/datajson/'
'url': ('https://raw.githubusercontent.com/GSA/ckanext-datajson/main/ckanext/datajson/'
'tests/datajson-samples/missing-dataset-fields.data.json')
})

Expand All @@ -98,7 +98,7 @@ def test_data_json_missing_catalog_fields(self, app):
''' Test that an invalid data.json that is missing catalog fields fails '''

res = app.post('/dcat-us/validator', data={
'url': ('https://raw.githubusercontent.com/GSA/ckanext-datajson/datajson-validator/ckanext/datajson/'
'url': ('https://raw.githubusercontent.com/GSA/ckanext-datajson/main/ckanext/datajson/'
'tests/datajson-samples/missing-catalog.data.json')
})

Expand Down
90 changes: 90 additions & 0 deletions docs/datajson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Data.json

The following documents tips and nice-to-know things about enabling the
`datajson` plugin.


## Configuration

You can customize the URL that generates the data.json output:

ckanext.datajson.path = /data.json
ckanext.datajsonld.path = /data.jsonld
ckanext.datajsonld.id = http://www.youragency.gov/data.json

You can enable or disable the Data.json output by setting

ckanext.datajson.url_enabled = False

If ckanext.datajsonld.path is omitted, it defaults to replacing ".json" in your
ckanext.datajson.path path with ".jsonld", so it probably won't need to be
specified.

The option ckanext.datajsonld.id is the @id value used to identify the data
catalog itself. If not given, it defaults to ckan.site_url.

You can specify which export map file to use to generates the data.json

ckanext.datajson.export_map_filename = export.map.json

There are three map files available in folder [export_map](https://github.com/GSA/ckanext-datajson/tree/main/ckanext/datajson/export_map)
to choose from, or you can add you own in the same folder. By default, it looks
for file `export.map.json`, if not found, it defaults to
`export.catalog.map.sample.json`.


## Caching /data.json

If you're deploying inside Apache, some caching would be a good idea
because generating the /data.json file can take a good few moments.
Enable the cache modules:

a2enmod cache
a2enmod disk_cache

And then in your Apache configuration add:

CacheEnable disk /data.json
CacheRoot /tmp/apache_cache
CacheDefaultExpire 120
CacheMaxFileSize 50000000
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheStoreNoStore On

And be sure to create /tmp/apache_cache and make it writable by the Apache process.


## Generating /data.json Off-Line

Generating this file is a little slow, so an alternative instead of caching is
to generate the file periodically (e.g. in a cron job). In that case, you'll want
to change the path that CKAN generates the file at to something *other* than /data.json.
In your CKAN .ini file, in the app:main section, add:

ckanext.datajson.path = /internal/data.json

Now create a crontab file ("mycrontab") to download this URL to a file on disk
every ten minutes:

0-59/10 * * * * wget -qO /path/to/static/data.json http://localhost/internal/data.json

And activate your crontab like so:

crontab mycrontab

In Apache, we'll want to block outside access to the "internal" URL, and also
map the URL /data.json to the static file. In your httpd.conf, add:

Alias /data.json /path/to/static/data.json

<Location /internal/>
Order deny,allow
Allow from 127.0.0.1
Deny from all
</Location>

And then restart Apache. Wait for the cron job to run once, then check if
/data.json loads (and it should be fast!). Also double check that
http://yourdomain.com/internal/data.json gives a 403 forbidden error when
accessed from some other location.
Loading