Skip to content

Comparing OAI PMH and Backbone Importer for DAAN RDF pipeline

wmelder edited this page Apr 8, 2022 · 18 revisions

OAI-PMH and the Backbone importer are two potential solutions for the first step in the LOD pipeline, for making NISV data available as linked data. This first step requires RDF triples to be produced from DAAN data.

The OAI-PMH method for producing RDF triples from DAAN is implemented in a single project, the beng-lod-server (daan-oai-to-rdf branch). The backbone method for producing RDF triples from DAAN is implemented in two projects, the x-omgeving-backbone-rdf project (in GitLab) for creating RDF from an initial import and from updates, and the beng-lod-server project (daan-storage-api-to-rdf branch) for ad-hoc access to specific resources via the Storage API. (Note: should there be issues with using the Storage API, then an alternative way to get ad-hoc access to the data would be via a triple store.) For some actions an additional script would be required (e.g. to retrieve triples for a subset of resources). These scripts have not been implemented.

Both methods currently use the RDF schema as a basis for converting the data. In the future we will consider using RML for this conversion instead.

The triples produced by both methods can then be delivered to users, but that is a job for the later steps in the pipeline, it is not considered here.

Key concepts to include in the Linked data are:

  • Hierarchy series/season/programme/scene description
  • Title
  • Genre
  • Catalog
  • Distribution channel
  • Carrier type? (we exclude this for present due to low performance when including all carriers)
  • Sort date
  • Broadcaster
  • Summary/Description
  • Network
  • Link to item online
  • People (Executive/creator/personname…)
  • Rights
  • Locations (recording, museum, and just location)
  • Keywords
Factor OAI-PMH Backbone Comment
Ad-hoc access Yes Yes
Initial load Yes, requires additional script to get IDs for retrieval (via OAI) Yes
Retrieval of specific set of data (e.g. the Polygoon collection) Yes, requires additional script to get IDs for retrieval Yes, requires additional script to get IDs for retrieval Can retrieve IDs from the OAI if want a date range, for other subsets of the data would need to query an ES index or a triple store
Incremental updates Yes, requires additional script to get IDs for retrieval (via OAI) Yes
Upload to triple store Yes Yes Both produce triples, methods for uploading these triples are then the same for either option (either updating the triple store directly or saving resources to turtle files that are then uploaded)
Reconciliation of GTAA concepts Yes Yes reconciliation is done as an integral part of the conversion to RDF, and is common to both methods
Availability of core fields (see above) All except link to online All except link to online Not sure if this field exists in DAAN
Availability of subtitles No Yes Questionable whether we are interested in subtitles in linked data
Availability of rights information Yes e.g. this program Yes I wonder if some or all fields may be removed when the OAI-PMH goes into production, as there are e.g. email addresses and telephone numbers in the rights note. I also wonder if we should even show such fields in the Media Suite
Suitability for Media Suite LD No, limited to publicly available metadata Yes, same data as Media Suite
Suitability for public LOD Yes Yes, but data would have to be filtered
Performance Ad-hoc access is fast enough for easy browsing Ad-hoc access is fast enough for easy browsing. Processing of backbone messages takes about a second per message (calculated from time for a number of items divided by the number of items processed). Loading to triple store is slow, but this is common to both
Fit with x-omgeving Stand-alone beng-lod-server is stand-alone. Could combine backbone import with backbone import for Elastic Search - but is this desirable?
Ease of maintaining Adds an additional data source to the x-omgeving. Core code is a single project. Requires additional scripts for retrieving subsets, initial loading and updates Uses a data source that is already in use in the x-omgeving. Core code is split over two projects. Requires an additional script for retrieving subsets
Dependencies Dependent on OAI-PMH Dependent on Storage API and backbone pusher Do we have any idea of how well these are going to be supported/maintained over coming years? What sort of capacity do they have? Are there any restrictions on the load we are allowed to place on them?