-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAPIR Harvesting #48
Comments
We will want to consider a For an example, here's the approach we have used in pycsw: https://github.com/geopython/pycsw/blob/master/pycsw/core/metadata.py#L48. |
2020-09-15 OGC Member Meeting. There are some use cases that are missing. Happy to provide information about the use cases. @uvoges OAI-PMH is another approach that has been successfully used. @chris-little OAI-PMH is outdated but was easy to implement. We should align these use cases. @uvoges Worth having an extension (further Part) on harvesting. Jari Harvesting and transactions should be two separate extensions of OGC API - Records. @uvoges |
Related: Inssue #50 |
couple thoughts on this: HARVEST'The harvest operation will result in N catalogue records being created' - the harvest operation only returns existing records, right? it doesn't 'create' anything. In general we approach harvesting as a combination of a list of items and a way to iterate through that list. for example:
I agree on the MIME type issue. We also parse links we find to understand them. many times the metadata cannot be trusted and something is claimed to be a WMS, while the link is a GetMap and the resources is thus not more than a fancy link to a PNG image. I'm not sure I understand the discussion on harvest identifiers or periodic reharvest. Our approach has always been that the server does not maintain knowledge of who or when it has been harvested. It is up to the client to maintain that information and we do this in Geoportal Server. And there we do know that records were harvested during some harvest job with an Id. Perhaps the server mentioned in this section is not the same as the server that was harvested? In Geoportal Server Harvester we support both intervals and specific times for reharvest. I would not call this 'trigger' a harvest, but 'schedule' a harvest. The individual harvest jobs get triggered based on the schedule. The DELETE of documents related to a harvest id seems ok. however, what if the same catalog is reharvested periodically? do the individual harvests get their own id? if so, then how do you delete all items harvested from a specific catalog across multiple jobs? Perhaps a way to delete all harvested from a specified source? |
an important aspect of harvest/imports is deduplication and referencing the canonical (point-of-truth) url of the harvested item. The same resource potentially is harvested via various routes (local-regional-national-global). It would be interesting to have on the /collections/{cat}/items and /collections/{cat}/items/{id} endpoints some harvest facilitating properties in case a record has been imported; the canonical url of the item, the (last) harvested date, the encoding/schema in which it was harvested and maybe a hash of the original document GeoNetwork's approach to harvesting seems similar to that of geoportal, we schedule harvests to run periodically, only remotely updated resources are imported again To retrieve the list of record-id to harvest, we usually allow people to filter based on filters, the usual item filter parameter could apply. After retrieving the initial list, it would be helpfull to retrieve records in batches of 50/100 with a /records/{cat}/items?id=[12,17,23,45] operation (or CQL) For oapir to oapir harvesting, consider to recommend the sitemap specification to facilitate the 'initial list of record-ids' case. Many catalogue implementations already support the sitemap specification. sitemap specification holds for each record the url and last update date and supports pagination for larger catalogues |
from the ogc sprint today: a typical use case on harvesting is this one: request all records which have changed since last harvest, this is easy by filtering on last modification date, but it is impossible for resources which have been removed, unless the api provides a mechanism to retrieve removed items The question we had yesterday is of interest which of /collections/xxx or collections/cat/items/xxx would be the canonical url within the server (typically harvesters only harvest canonical things). In this scenario a dataset, xxx, dissiminated as a collection via ogc-api-features, could also be available as a record item in a ogc-api-records collection |
this is the role the dcat file plays in the open data space. |
Thanx @mhogeweg, Do you have an example of such a dcat file, I know dcat only as a metadata model. |
here is one from the Africa Geoportal (ArcGIS Hub): https://www.africageoportal.com/data.json |
In our Geoportal Server metadata catalog, we generate such a file automatically (as a cached version, since the catalogs can become quite large) as well as in one of the available output formats of the search API. for example: https://gpt.geocloud.com/geoportal2/opensearch?f=dcat&from=1&size=10&sort=title.sort%3Aasc&esdsl=%7B%7D |
It seems a full dump of the database in rdf (jsonld), also an interesting feature considering harvesting. indeed important to cache it at intervals, some products will take minutes to generate such a file. For the use case of identifying removed records since last harvest, I only need a list of record identifiers. Sitemap.xml would be an interesting candidate, also because of its wide adoption, but a json index file at the root of a collection would also be fine… Search engine crawler expects only a single sitemap on a domain, so not in sub folders, but you can link to multiple decentral sitemaps from a central sitemap, see https://www.sitemaps.org/protocol.html#index |
we have had a sitemap in Geoportal Server for many years. It does like you describe: |
@mhogeweg at al, I was using the term "harvest" in the way it was used in CSW 3.0. Let me try and explain the use case ... Say you have web accessible resource somewhere (e.g. a satellite imagery product) that you want to make discoverable. One way to do that is to create one or more JSON documents (i.e. records) describing the resource and then POSTing those documents to the |
This is what we do in Geoportal Server when harvesting an ArcGIS Server site. With the top-level endpoint to the ArcGIS REST Services Directory web page, such as this one services.arcgisonline.com, we then crawl every folder, every service, every endpoint (ArcGIS REST, W*S, SOAP) and if desired every layer and generate individual metadata records for these that can then be ingested into a catalog. We have selected small Dublin Core style records for this (an example), as we find that most of these services/layers have very little descriptive metadata even when one can have full metadata for them. I'm not hung up on the terminology. Using Geoportal Server Harvester we harvest services from the ArcGIS Server (and from various other types of sources) and register them in the Catalog (or in ArcGIS Online/Portal). |
Yes, exactly ... we do that with the old W*S services and the new OGC APIs. We get the URL of the capabilities document OR the root URL and the catalog crawls the resources and registers everything it finds ... feature types, coverages, services, processes, etc. |
@tomkralidis asked me about Harvesting and this is what I send him. I don't think harvesting would be part of the core but I am creating an issue anyway to stimulate discussion.
HARVEST
This is an example of harvesting one or more resources. You basically do a POST on a harvest endpoint. The harvest operation will result in N catalogue records being created.
Generally, the kinds of resource that I harvest in my catalogue (OGC landing pages, OGC capabilities documents, ISO metadata documents, etc.) do not have specific MIME types other than the general MIME type of the representation (e.g. text/xml, application/json, etc.). So, my server sniffs the resource to see if it can recognize it. It would be a bit easier if OGC had specific MIME types defined for some of these resources instead of just generic MIME types (old discussion!).
To trigger periodic re-harvesting of the resources, the client can append the "harvestInterval" parameter to the harvest URL. Its value is an ISO 8601 period (e.g. ...&harvestInterval=P2W&...).
To trigger asynchronous processing, the client can append a "responseHandler" parameter to the harvest endpoint URL. See: https://docs.opengeospatial.org/per/18-045.html#async_extension.
The harvest identifier (hId) is an identifier assigned by the server so that a subsequent DELETE can be used to do an unharvest.
GET THE LIST OF HARVEST IDENTIFIERS
To get the list of harvest identifiers, you can do a GET on the harvest endpoint. A list of links to each harvest resource is returned.
I am not sure what the rel should be in this case. Perhaps something like "ogc:harvest". Dunno. Maybe we don't need one since this is just a list of harvested resources.
Doing a GET on a specific harvest resource URL will return details about that resource including links to the catalogue records created as a result of the harvest.
Again, I am not sure about the appropriate rel. I show "related" but maybe something like "ogc:record" would be more appropriate or maybe a rel is not required at all.
UNHARVEST
To unharvest a previous harvest, you simply use the DELETE on the harvest resource URL. The server would delete all the catalogue records created by the harvest along with the harvest resource itself so that doing a subsequent GET on the harvest endpoint would no longer list hid57 (in my example).
So, this is my thinking about harvesting so far. I have starting implementing some of this to see how it flies!
Comments welcome!
The text was updated successfully, but these errors were encountered: