Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAPIR Harvesting #48

Open
pvretano opened this issue Aug 10, 2020 · 15 comments
Open

OAPIR Harvesting #48

pvretano opened this issue Aug 10, 2020 · 15 comments

Comments

@pvretano
Copy link
Contributor

@tomkralidis asked me about Harvesting and this is what I send him. I don't think harvesting would be part of the core but I am creating an issue anyway to stimulate discussion.

HARVEST

This is an example of harvesting one or more resources. You basically do a POST on a harvest endpoint. The harvest operation will result in N catalogue records being created.

   CLIENT                                                     SERVER
     |                                                           |
     |   POST /collections/{catalogueId}/harvest HTTP/1.1        |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         }                                                 |
     |      ]                                                    |
     |   }                                                       |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 201 Created                                    |
     |   Location: /collections/{catalogueId}/harvest/{hId}      |
     |<----------------------------------------------------------|

Generally, the kinds of resource that I harvest in my catalogue (OGC landing pages, OGC capabilities documents, ISO metadata documents, etc.) do not have specific MIME types other than the general MIME type of the representation (e.g. text/xml, application/json, etc.). So, my server sniffs the resource to see if it can recognize it. It would be a bit easier if OGC had specific MIME types defined for some of these resources instead of just generic MIME types (old discussion!).

To trigger periodic re-harvesting of the resources, the client can append the "harvestInterval" parameter to the harvest URL. Its value is an ISO 8601 period (e.g. ...&harvestInterval=P2W&...).

To trigger asynchronous processing, the client can append a "responseHandler" parameter to the harvest endpoint URL. See: https://docs.opengeospatial.org/per/18-045.html#async_extension.

The harvest identifier (hId) is an identifier assigned by the server so that a subsequent DELETE can be used to do an unharvest.

GET THE LIST OF HARVEST IDENTIFIERS

To get the list of harvest identifiers, you can do a GET on the harvest endpoint. A list of links to each harvest resource is returned.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest   HTTP/1.1       |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid01",         |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid57"          |       
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid9"           |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

I am not sure what the rel should be in this case. Perhaps something like "ogc:harvest". Dunno. Maybe we don't need one since this is just a list of harvested resources.

Doing a GET on a specific harvest resource URL will return details about that resource including links to the catalogue records created as a result of the harvest.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest/hid57 HTTP/1.1   |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "harvestInterval": "P2W",                            |
     |      "lastHarvest": "2020-08-01T13:41:45",                |
     |      "records": [                                         |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r1013",         |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r0087"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r7373"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

Again, I am not sure about the appropriate rel. I show "related" but maybe something like "ogc:record" would be more appropriate or maybe a rel is not required at all.

UNHARVEST

To unharvest a previous harvest, you simply use the DELETE on the harvest resource URL. The server would delete all the catalogue records created by the harvest along with the harvest resource itself so that doing a subsequent GET on the harvest endpoint would no longer list hid57 (in my example).

   CLIENT                                                     SERVER
     |                                                           |
     |   DELETE /collections/{collectionId}/harvest/{hid57}   HTTP/1.1|
     |   Host: www.someserver.com                                |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 200 OK                                         |
     |<----------------------------------------------------------|

So, this is my thinking about harvesting so far. I have starting implementing some of this to see how it flies!

Comments welcome!

@tomkralidis
Copy link
Contributor

We will want to consider a resourcetype type of parameter which provides additional information on what is being harvested, or a link relation type, to aid in identifying a given resource.

For an example, here's the approach we have used in pycsw: https://github.com/geopython/pycsw/blob/master/pycsw/core/metadata.py#L48.

@ghobona
Copy link
Contributor

ghobona commented Sep 15, 2020

2020-09-15 OGC Member Meeting.

There are some use cases that are missing. Happy to provide information about the use cases. @uvoges

OAI-PMH is another approach that has been successfully used. @chris-little

OAI-PMH is outdated but was easy to implement. We should align these use cases. @uvoges

Worth having an extension (further Part) on harvesting. Jari

Harvesting and transactions should be two separate extensions of OGC API - Records. @uvoges

@pvretano
Copy link
Contributor Author

Related: Inssue #50

@mhogeweg
Copy link
Contributor

couple thoughts on this:

HARVEST

'The harvest operation will result in N catalogue records being created' - the harvest operation only returns existing records, right? it doesn't 'create' anything.

In general we approach harvesting as a combination of a list of items and a way to iterate through that list. for example:

  • A site provide a complete listing of the catalog content in some form. for example the DCAT variant for example for Data.gov, STACs, sitemaps, ArcGIS Servers, file shares, etc. there is no 'search' involved here, just 'give the list'.
  • Query for updates/additions since some date (the previous harvest date, or the beginning of time, etc) and then iterate over results to get the individual records. This is how we have done CSW harvesting for years: GetRecords + n*GetRecordById, but also THREDDS.

I agree on the MIME type issue. We also parse links we find to understand them. many times the metadata cannot be trusted and something is claimed to be a WMS, while the link is a GetMap and the resources is thus not more than a fancy link to a PNG image.

I'm not sure I understand the discussion on harvest identifiers or periodic reharvest. Our approach has always been that the server does not maintain knowledge of who or when it has been harvested. It is up to the client to maintain that information and we do this in Geoportal Server. And there we do know that records were harvested during some harvest job with an Id. Perhaps the server mentioned in this section is not the same as the server that was harvested?

In Geoportal Server Harvester we support both intervals and specific times for reharvest. I would not call this 'trigger' a harvest, but 'schedule' a harvest. The individual harvest jobs get triggered based on the schedule.

The DELETE of documents related to a harvest id seems ok. however, what if the same catalog is reharvested periodically? do the individual harvests get their own id? if so, then how do you delete all items harvested from a specific catalog across multiple jobs? Perhaps a way to delete all harvested from a specified source?

@pvgenuchten
Copy link
Contributor

pvgenuchten commented Nov 20, 2020

an important aspect of harvest/imports is deduplication and referencing the canonical (point-of-truth) url of the harvested item. The same resource potentially is harvested via various routes (local-regional-national-global). It would be interesting to have on the /collections/{cat}/items and /collections/{cat}/items/{id} endpoints some harvest facilitating properties in case a record has been imported; the canonical url of the item, the (last) harvested date, the encoding/schema in which it was harvested and maybe a hash of the original document

GeoNetwork's approach to harvesting seems similar to that of geoportal, we schedule harvests to run periodically, only remotely updated resources are imported again

To retrieve the list of record-id to harvest, we usually allow people to filter based on filters, the usual item filter parameter could apply. After retrieving the initial list, it would be helpfull to retrieve records in batches of 50/100 with a /records/{cat}/items?id=[12,17,23,45] operation (or CQL)

For oapir to oapir harvesting, consider to recommend the sitemap specification to facilitate the 'initial list of record-ids' case. Many catalogue implementations already support the sitemap specification. sitemap specification holds for each record the url and last update date and supports pagination for larger catalogues

@pvgenuchten
Copy link
Contributor

pvgenuchten commented Sep 15, 2022

from the ogc sprint today: a typical use case on harvesting is this one: request all records which have changed since last harvest, this is easy by filtering on last modification date, but it is impossible for resources which have been removed, unless the api provides a mechanism to retrieve removed items
one aspect to 'solve' the above it to provide a sitemap.xml with a listing of all the record-url's, without providing the full records, just to evaluate which ones are removed

The question we had yesterday is of interest which of /collections/xxx or collections/cat/items/xxx would be the canonical url within the server (typically harvesters only harvest canonical things). In this scenario a dataset, xxx, dissiminated as a collection via ogc-api-features, could also be available as a record item in a ogc-api-records collection

@mhogeweg
Copy link
Contributor

this is the role the dcat file plays in the open data space.

@pvgenuchten
Copy link
Contributor

Thanx @mhogeweg, Do you have an example of such a dcat file, I know dcat only as a metadata model.

@mhogeweg
Copy link
Contributor

here is one from the Africa Geoportal (ArcGIS Hub): https://www.africageoportal.com/data.json
and one from US EPA: https://edg.epa.gov/data.json

@mhogeweg
Copy link
Contributor

In our Geoportal Server metadata catalog, we generate such a file automatically (as a cached version, since the catalogs can become quite large) as well as in one of the available output formats of the search API.

for example: https://gpt.geocloud.com/geoportal2/opensearch?f=dcat&from=1&size=10&sort=title.sort%3Aasc&esdsl=%7B%7D

@pvgenuchten
Copy link
Contributor

It seems a full dump of the database in rdf (jsonld), also an interesting feature considering harvesting. indeed important to cache it at intervals, some products will take minutes to generate such a file.

For the use case of identifying removed records since last harvest, I only need a list of record identifiers. Sitemap.xml would be an interesting candidate, also because of its wide adoption, but a json index file at the root of a collection would also be fine…

Search engine crawler expects only a single sitemap on a domain, so not in sub folders, but you can link to multiple decentral sitemaps from a central sitemap, see https://www.sitemaps.org/protocol.html#index

@mhogeweg
Copy link
Contributor

mhogeweg commented Sep 15, 2022

we have had a sitemap in Geoportal Server for many years. It does like you describe:
https://gpt.geocloud.com/geoportal/sitemap?f=sitemap

@pvretano
Copy link
Contributor Author

@mhogeweg at al, I was using the term "harvest" in the way it was used in CSW 3.0. Let me try and explain the use case ... Say you have web accessible resource somewhere (e.g. a satellite imagery product) that you want to make discoverable. One way to do that is to create one or more JSON documents (i.e. records) describing the resource and then POSTing those documents to the /collections/{catalogId}/items endpoint of the catalog to create one or more records in the catalog. This approach, however, puts the burden on the client to know how to create the records and POST them to the catalog. Another approach is that used in CSW 3.0 which is to simply pass the URL of the resource to the catalog and -- if the catalog recognizes the resource type -- letting it do whatever magic needs to be done to create record(s) in the catalog to make that resource discoverable. In CSW 3.0 this operation was called "harvest". Based on the discussion in this issue, however, I am thinking that perhaps "harvest" is not the right term ... "register" maybe?

@mhogeweg
Copy link
Contributor

This is what we do in Geoportal Server when harvesting an ArcGIS Server site. With the top-level endpoint to the ArcGIS REST Services Directory web page, such as this one services.arcgisonline.com, we then crawl every folder, every service, every endpoint (ArcGIS REST, W*S, SOAP) and if desired every layer and generate individual metadata records for these that can then be ingested into a catalog.

We have selected small Dublin Core style records for this (an example), as we find that most of these services/layers have very little descriptive metadata even when one can have full metadata for them.

I'm not hung up on the terminology. Using Geoportal Server Harvester we harvest services from the ArcGIS Server (and from various other types of sources) and register them in the Catalog (or in ArcGIS Online/Portal).

@pvretano
Copy link
Contributor Author

Yes, exactly ... we do that with the old W*S services and the new OGC APIs. We get the URL of the capabilities document OR the root URL and the catalog crawls the resources and registers everything it finds ... feature types, coverages, services, processes, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants