In Feb/Mar 2017 data.govt.nz dataset portal was moved to a new CKAN data portal platform. To make sure migration was replicable and not just a copy and paste job (prone to data loss and error), we made use of the built in CKAN API and a number of pragmatic scripts to wrangle the data loading. As a result, we were able to migrate faster and make use of other existing government services (like the govt.nz organisation API to populate our initial list of data releasing organisations).
Install the python based CKANAPI Command Line Interface (CLI). See https://github.com/ckan/ckanapi This allows you at use the CKAN API to push and pull data directly out of any CKAN installation (making it very useful for migration work).
CKANAPI export are stored in a format called JSON Linked Data (jsonld). See http://json-ld.org/
$ cd /path/to/script
$ php ./organisations.php
ckanapi load organizations -I ./organisations.jsonld -r CKAN_URL -a API_KEY
-
Login as sysadmin and find the non-active organsiation (most NZ govt orgs exist already as they were loaded from a full list of all govt orgs).
-
Copy the id of the org eg
marlborough-district-council
. -
Run the following ckanapi command
ckanapi action organization_update id=marlborough-district-council state=active -r CKAN_URL -a API_KEY
ckanapi load groups -I ./groups.jsonld -r CKAN_URL -a API_KEY
Note: this does not purge
the datasets just set them to status: deleted
ckanapi action package_list -j -r CKAN_URL | ckanapi delete datasets -r CKAN_URL -a API_KEY
This pipes a list of existing datasets into the ckanapi delete command.
ckanapi dump datasets -O /path/to/datasets.jsonld --all -r CKAN_URL
Dumps out all datasets and harvester configs into a single JSON linked data formate file.
ckanapi load datasets -I /path/to/datasets.jsonld -r CKAN_URL -a API_KEY
Loads datasets into CKAN.
- You'll need to have already loaded in the groups and organisations before loading in datasets (due to the data relations).
- Harvesters can sometimes run during the loading process which can cause duplicate datasets (harvester runs before a dataset is imported).
- CKANAPI imported resources can sometimes not fire the datapusher and be added in the datastore for data exploration. In this case you can make use of the
datastore_create
API call using a list of known erroring resource id numbers. It is possible that this occurs due to the imported resource propertydatastore_active
beingtrue
when in a new CKAN instance the datastore is empty for this resource (yet to test this assumption).
The old URL pattern uses the format dataset/show/1234
which does not directly map to the new URLs on the CKAN portal. CKAN uses SEO friendly URLs eg. dataset/my-extremely-interesting-dataset-about-stuff
. The final segment of the URL is based on the dataset title.
As a way of gaining a mapping we have used an export from our web analytics including the old dataset id number and the page title. We then run this through a simple spreadsheet formula to extract the title name and build the best guess for the new URL.
=concat(lookup!$A$1,lower(ArrayFormula(REGEXREPLACE(SUBSTITUTE(data!B1," » Data.govt.nz",""), "[^a-zA-Z0-9]+", "-"))))
This:
- looks up the base domain in a separate worksheet ie. catalogue.data.govt.nz
- Removes the un required text from the title string i.e » Data.govt.nz
- Converts the string into a URL segment
See included open spreadsheet tool.
See LICENSE.md
See CONTRIBUTING.md
- Cam Findlay [email protected]
- Data.govt.nz team [email protected]