OAPIR Harvesting #48

pvretano · 2020-08-10T02:32:24Z

@tomkralidis asked me about Harvesting and this is what I send him. I don't think harvesting would be part of the core but I am creating an issue anyway to stimulate discussion.

HARVEST

This is an example of harvesting one or more resources. You basically do a POST on a harvest endpoint. The harvest operation will result in N catalogue records being created.

   CLIENT                                                     SERVER
     |                                                           |
     |   POST /collections/{catalogueId}/harvest HTTP/1.1        |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         }                                                 |
     |      ]                                                    |
     |   }                                                       |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 201 Created                                    |
     |   Location: /collections/{catalogueId}/harvest/{hId}      |
     |<----------------------------------------------------------|

Generally, the kinds of resource that I harvest in my catalogue (OGC landing pages, OGC capabilities documents, ISO metadata documents, etc.) do not have specific MIME types other than the general MIME type of the representation (e.g. text/xml, application/json, etc.). So, my server sniffs the resource to see if it can recognize it. It would be a bit easier if OGC had specific MIME types defined for some of these resources instead of just generic MIME types (old discussion!).

To trigger periodic re-harvesting of the resources, the client can append the "harvestInterval" parameter to the harvest URL. Its value is an ISO 8601 period (e.g. ...&harvestInterval=P2W&...).

To trigger asynchronous processing, the client can append a "responseHandler" parameter to the harvest endpoint URL. See: https://docs.opengeospatial.org/per/18-045.html#async_extension.

The harvest identifier (hId) is an identifier assigned by the server so that a subsequent DELETE can be used to do an unharvest.

GET THE LIST OF HARVEST IDENTIFIERS

To get the list of harvest identifiers, you can do a GET on the harvest endpoint. A list of links to each harvest resource is returned.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest   HTTP/1.1       |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid01",         |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid57"          |       
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid9"           |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

I am not sure what the rel should be in this case. Perhaps something like "ogc:harvest". Dunno. Maybe we don't need one since this is just a list of harvested resources.

Doing a GET on a specific harvest resource URL will return details about that resource including links to the catalogue records created as a result of the harvest.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest/hid57 HTTP/1.1   |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "harvestInterval": "P2W",                            |
     |      "lastHarvest": "2020-08-01T13:41:45",                |
     |      "records": [                                         |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r1013",         |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r0087"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r7373"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

Again, I am not sure about the appropriate rel. I show "related" but maybe something like "ogc:record" would be more appropriate or maybe a rel is not required at all.

UNHARVEST

To unharvest a previous harvest, you simply use the DELETE on the harvest resource URL. The server would delete all the catalogue records created by the harvest along with the harvest resource itself so that doing a subsequent GET on the harvest endpoint would no longer list hid57 (in my example).

   CLIENT                                                     SERVER
     |                                                           |
     |   DELETE /collections/{collectionId}/harvest/{hid57}   HTTP/1.1|
     |   Host: www.someserver.com                                |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 200 OK                                         |
     |<----------------------------------------------------------|

So, this is my thinking about harvesting so far. I have starting implementing some of this to see how it flies!

Comments welcome!

The text was updated successfully, but these errors were encountered:

tomkralidis · 2020-08-24T14:43:48Z

We will want to consider a resourcetype type of parameter which provides additional information on what is being harvested, or a link relation type, to aid in identifying a given resource.

For an example, here's the approach we have used in pycsw: https://github.com/geopython/pycsw/blob/master/pycsw/core/metadata.py#L48.

ghobona · 2020-09-15T12:41:16Z

2020-09-15 OGC Member Meeting.

There are some use cases that are missing. Happy to provide information about the use cases. @uvoges

OAI-PMH is another approach that has been successfully used. @chris-little

OAI-PMH is outdated but was easy to implement. We should align these use cases. @uvoges

Worth having an extension (further Part) on harvesting. Jari

Harvesting and transactions should be two separate extensions of OGC API - Records. @uvoges

pvretano · 2020-09-21T10:55:28Z

Related: Inssue #50

mhogeweg · 2020-09-27T21:56:18Z

couple thoughts on this:

HARVEST

'The harvest operation will result in N catalogue records being created' - the harvest operation only returns existing records, right? it doesn't 'create' anything.

In general we approach harvesting as a combination of a list of items and a way to iterate through that list. for example:

A site provide a complete listing of the catalog content in some form. for example the DCAT variant for example for Data.gov, STACs, sitemaps, ArcGIS Servers, file shares, etc. there is no 'search' involved here, just 'give the list'.
Query for updates/additions since some date (the previous harvest date, or the beginning of time, etc) and then iterate over results to get the individual records. This is how we have done CSW harvesting for years: GetRecords + n*GetRecordById, but also THREDDS.

I agree on the MIME type issue. We also parse links we find to understand them. many times the metadata cannot be trusted and something is claimed to be a WMS, while the link is a GetMap and the resources is thus not more than a fancy link to a PNG image.

I'm not sure I understand the discussion on harvest identifiers or periodic reharvest. Our approach has always been that the server does not maintain knowledge of who or when it has been harvested. It is up to the client to maintain that information and we do this in Geoportal Server. And there we do know that records were harvested during some harvest job with an Id. Perhaps the server mentioned in this section is not the same as the server that was harvested?

In Geoportal Server Harvester we support both intervals and specific times for reharvest. I would not call this 'trigger' a harvest, but 'schedule' a harvest. The individual harvest jobs get triggered based on the schedule.

The DELETE of documents related to a harvest id seems ok. however, what if the same catalog is reharvested periodically? do the individual harvests get their own id? if so, then how do you delete all items harvested from a specific catalog across multiple jobs? Perhaps a way to delete all harvested from a specified source?

pvgenuchten · 2020-11-20T14:28:00Z

an important aspect of harvest/imports is deduplication and referencing the canonical (point-of-truth) url of the harvested item. The same resource potentially is harvested via various routes (local-regional-national-global). It would be interesting to have on the /collections/{cat}/items and /collections/{cat}/items/{id} endpoints some harvest facilitating properties in case a record has been imported; the canonical url of the item, the (last) harvested date, the encoding/schema in which it was harvested and maybe a hash of the original document

GeoNetwork's approach to harvesting seems similar to that of geoportal, we schedule harvests to run periodically, only remotely updated resources are imported again

To retrieve the list of record-id to harvest, we usually allow people to filter based on filters, the usual item filter parameter could apply. After retrieving the initial list, it would be helpfull to retrieve records in batches of 50/100 with a /records/{cat}/items?id=[12,17,23,45] operation (or CQL)

For oapir to oapir harvesting, consider to recommend the sitemap specification to facilitate the 'initial list of record-ids' case. Many catalogue implementations already support the sitemap specification. sitemap specification holds for each record the url and last update date and supports pagination for larger catalogues

pvgenuchten · 2022-09-15T15:59:14Z

from the ogc sprint today: a typical use case on harvesting is this one: request all records which have changed since last harvest, this is easy by filtering on last modification date, but it is impossible for resources which have been removed, unless the api provides a mechanism to retrieve removed items
one aspect to 'solve' the above it to provide a sitemap.xml with a listing of all the record-url's, without providing the full records, just to evaluate which ones are removed

The question we had yesterday is of interest which of /collections/xxx or collections/cat/items/xxx would be the canonical url within the server (typically harvesters only harvest canonical things). In this scenario a dataset, xxx, dissiminated as a collection via ogc-api-features, could also be available as a record item in a ogc-api-records collection

mhogeweg · 2022-09-15T17:39:49Z

this is the role the dcat file plays in the open data space.

pvgenuchten · 2022-09-15T18:56:23Z

Thanx @mhogeweg, Do you have an example of such a dcat file, I know dcat only as a metadata model.

mhogeweg · 2022-09-15T19:14:48Z

here is one from the Africa Geoportal (ArcGIS Hub): https://www.africageoportal.com/data.json
and one from US EPA: https://edg.epa.gov/data.json

mhogeweg · 2022-09-15T19:16:36Z

In our Geoportal Server metadata catalog, we generate such a file automatically (as a cached version, since the catalogs can become quite large) as well as in one of the available output formats of the search API.

for example: https://gpt.geocloud.com/geoportal2/opensearch?f=dcat&from=1&size=10&sort=title.sort%3Aasc&esdsl=%7B%7D

pvgenuchten · 2022-09-15T21:59:17Z

It seems a full dump of the database in rdf (jsonld), also an interesting feature considering harvesting. indeed important to cache it at intervals, some products will take minutes to generate such a file.

For the use case of identifying removed records since last harvest, I only need a list of record identifiers. Sitemap.xml would be an interesting candidate, also because of its wide adoption, but a json index file at the root of a collection would also be fine…

Search engine crawler expects only a single sitemap on a domain, so not in sub folders, but you can link to multiple decentral sitemaps from a central sitemap, see https://www.sitemaps.org/protocol.html#index

mhogeweg · 2022-09-15T22:39:56Z

we have had a sitemap in Geoportal Server for many years. It does like you describe:
https://gpt.geocloud.com/geoportal/sitemap?f=sitemap

pvretano · 2024-12-21T23:51:42Z

@mhogeweg at al, I was using the term "harvest" in the way it was used in CSW 3.0. Let me try and explain the use case ... Say you have web accessible resource somewhere (e.g. a satellite imagery product) that you want to make discoverable. One way to do that is to create one or more JSON documents (i.e. records) describing the resource and then POSTing those documents to the /collections/{catalogId}/items endpoint of the catalog to create one or more records in the catalog. This approach, however, puts the burden on the client to know how to create the records and POST them to the catalog. Another approach is that used in CSW 3.0 which is to simply pass the URL of the resource to the catalog and -- if the catalog recognizes the resource type -- letting it do whatever magic needs to be done to create record(s) in the catalog to make that resource discoverable. In CSW 3.0 this operation was called "harvest". Based on the discussion in this issue, however, I am thinking that perhaps "harvest" is not the right term ... "register" maybe?

mhogeweg · 2024-12-22T01:16:52Z

This is what we do in Geoportal Server when harvesting an ArcGIS Server site. With the top-level endpoint to the ArcGIS REST Services Directory web page, such as this one services.arcgisonline.com, we then crawl every folder, every service, every endpoint (ArcGIS REST, W*S, SOAP) and if desired every layer and generate individual metadata records for these that can then be ingested into a catalog.

We have selected small Dublin Core style records for this (an example), as we find that most of these services/layers have very little descriptive metadata even when one can have full metadata for them.

I'm not hung up on the terminology. Using Geoportal Server Harvester we harvest services from the ArcGIS Server (and from various other types of sources) and register them in the Catalog (or in ArcGIS Online/Portal).

pvretano · 2024-12-22T02:27:00Z

Yes, exactly ... we do that with the old W*S services and the new OGC APIs. We get the URL of the capabilities document OR the root URL and the catalog crawls the resources and registers everything it finds ... feature types, coverages, services, processes, etc.

pvretano added the extension label Aug 10, 2020

pvretano mentioned this issue Sep 21, 2020

Align with Standard Harvesting Protocol like OAI-PMH (probably an extension) #50

Closed

tomkralidis mentioned this issue Jun 16, 2021

support distributed search capability #118

Open

pvretano added the 2022-09 Sprint label Sep 15, 2022

pvretano mentioned this issue Sep 30, 2024

New Work Items #391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAPIR Harvesting #48

OAPIR Harvesting #48

pvretano commented Aug 10, 2020

tomkralidis commented Aug 24, 2020

ghobona commented Sep 15, 2020 •

edited

Loading

pvretano commented Sep 21, 2020

mhogeweg commented Sep 27, 2020

pvgenuchten commented Nov 20, 2020 •

edited

Loading

pvgenuchten commented Sep 15, 2022 •

edited

Loading

mhogeweg commented Sep 15, 2022

pvgenuchten commented Sep 15, 2022

mhogeweg commented Sep 15, 2022

mhogeweg commented Sep 15, 2022

pvgenuchten commented Sep 15, 2022

mhogeweg commented Sep 15, 2022 •

edited

Loading

pvretano commented Dec 21, 2024

mhogeweg commented Dec 22, 2024

pvretano commented Dec 22, 2024

OAPIR Harvesting #48

OAPIR Harvesting #48

Comments

pvretano commented Aug 10, 2020

HARVEST

GET THE LIST OF HARVEST IDENTIFIERS

UNHARVEST

tomkralidis commented Aug 24, 2020

ghobona commented Sep 15, 2020 • edited Loading

pvretano commented Sep 21, 2020

mhogeweg commented Sep 27, 2020

HARVEST

pvgenuchten commented Nov 20, 2020 • edited Loading

pvgenuchten commented Sep 15, 2022 • edited Loading

mhogeweg commented Sep 15, 2022

pvgenuchten commented Sep 15, 2022

mhogeweg commented Sep 15, 2022

mhogeweg commented Sep 15, 2022

pvgenuchten commented Sep 15, 2022

mhogeweg commented Sep 15, 2022 • edited Loading

pvretano commented Dec 21, 2024

mhogeweg commented Dec 22, 2024

pvretano commented Dec 22, 2024

ghobona commented Sep 15, 2020 •

edited

Loading

pvgenuchten commented Nov 20, 2020 •

edited

Loading

pvgenuchten commented Sep 15, 2022 •

edited

Loading

mhogeweg commented Sep 15, 2022 •

edited

Loading