-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Part 3: Types of Collections? #394
Comments
@m-mohr See also my reply about Collection Input / Output in the other issue.
It can load from an OGC API collection as defined by OGC API - Common - Part 2: Geospatial data, which is still in draft, but the concept of a collection is already quite clear and supported by several OGC API data access standards including:
I am less familiar with OGC API - Connected Systems, but I think it also fits into that category (as an extension of Features?). A couple clarifications:
In particular, an implementation conceptually does not serve a "collection of tiles", or a collection of DGGS zones, rather it serves a collection of data (features, gridded coverage cell values, point cloud, 3D meshes) which has been tiled, has been organized as a DGGS or as a bounding volume hierarchy. OGC API - Routes is excluded because it does not depend on Common - Part 2 (no OGC API - Styles is excluded because it shares portrayal information, not spatiotemporal data (no dependency on Common - Part 2 or OGC API - Records in its "Features" style incarnation is on the fence in that in a sense it inherits from Features, but normally it is about metadata rather than actual spatiotemporal data.
Currently, the assumption is based on the process description inputs, whereas based on the media types of the input (e.g., GeoTIFF or GeoJSON) and the additional In terms of which OGC APIs the server actually support accessing as a client, there is not yet a clear mechanism how to specify that. It would certainly make sense to clarify that. Potentially as parameters to the collection-input requirement class at
The requirement is for the API to support at least one of the OGC API data access mechanism.
About:
Like I mentioned for Records above, STAC is about metadata. The idea is likely to load the relevant scenes described by the catalog as a single coverage, in effect wanting to load the "data" referenced by STAC rather than just the image footprints as vector features. This is one of the thing that the Coverage Scenes requirement class tries to clarify -- supporting the same data model and type of queries as STAC, but having this STAC API at
If the goal is to load spatiotemporal data (e.g., coverage scenes), it should be clear that this is what is intended. Perhaps my suggestion above to clarify that a collection of STAC scene metadata should be loaded as coverage data input would address this? I'm not sure whether there would be a use case to extend this collection input to non-spatiotemporal data. The Collection Input/Output mechanisms are really about this idea of spatiotemporal datacubes (a collection is a coverage / is a data cube).
This is the Input Field Modifiers requirement class, where |
I share @m-mohr's concerns, especially regarding points like "Is there any way a Processes API can describe which types of collections it can load from? If the requirement is to support all by default, that's a pretty steep requirement.". I think the specification still lacks a lot of details regarding "what to do" with whichever collection type that gets loaded. I understand that the idea is to offer flexibility, such that the process expecting some input/output format can negotiate with the referenced collections to pick depending on available service types that suits best, but my impression is that there are still a lot of assumptions that schemas and content-types are enough for data structures to align with the processing intensions. Calling equivalent data collections would yield different results depending on which services the server being called implements. Because the procedure is very abstract, I cannot foresee any way to make this interoperable reliably between implementations. If there was a single reference implementation to interact with all collection types (and correct me if there is one that I don't know of), that would at least help align interoperability since implementation could understand expectations for each case. For example with STAC, there is |
Thanks for the feedback @fmigneault .
I fully agree it would be good to improve on that, with some suggestions how we could go about it in my comment above.
That is not the case, so no steep requirement :)
So far we implemented the closest thing to a reference implementation at https://maps.gnosis.earth/ogcapi . The RenderMap process supports both feature collections and coverages. Our server suports accessing remote collections as a client through several OGC APIs: Tiles (Vector Tiles, Coverage Tiles, Map Tiles), Maps, Features, Coverages, EDR (Cube queries only) and Processes (using Collection Output). The PassThrough process also expects either a Coverage or a Feature Collection, and provides an opportunity to apply filter or properties to filter, select or derive new fields/properties using a CQL2 query. For example you can use it to compute an NDVI. It can also be use to easily cascade from any of the supported OGC API, making it available through all of the APIs / Tile Matrix Sets / DGGS / formats supported by the GNOSIS Map Server. The current example is broken due to the cascaded server not being available, so here is an NDVI example to POST at https://maps.gnosis.earth/ogcapi/processes/PassThrough/execution?response=collection: {
"process" : "https://maps.gnosis.earth/ogcapi/processes/PassThrough",
"inputs" : {
"data" : [
{
"collection" : "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a",
"properties" : [ "(B08 - B04) / (B08 + B04)" ]
}
]
}
} Feature Attributes Combiner supports feature collections. Elevation Contour Tracer expects a DEM coverage. To use remote collections, the external APIs need to be explicitly authorized, so let me know if you want to try doing some TIEs with a particular server. We also have some point cloud and 3D mesh client / processing capabilities, so we could also work towards that. It would be great to experiment on some of this at the Code Sprint in Portugal. In Testbed 19, WHU did successfully implement Collection Output at http://oge.whu.edu.cn/geocube/gdc_api_t19/ , and Compusult also implemented it in their GDC API (have to re-test it to confirm whether they addressed the remaining issues). WHU also mostly have a working "Collection Input" for their local collections with a small tweak needed, as mentioned at #325 (comment) . See https://gitlab.ogc.org/ogc/T19-GDC/-/wikis/Ecere#technology-integration-experiments . See also demo day presentation for GDC if you have not yet watched it. Multiple clients were able to access the processing as Collection Output. There were also successful Collection Input / Output experiments in Testbed 17 ( https://docs.ogc.org/per/21-027.html#toc64 ). Some key things to clarify: Collection Input is for collections locally available on the same API. Whether a collection is suitable for a particular process (a topic looked at in Testbed 16 - DAPA) is a problem that is not limited to Collection Input / Remote Collections. It's also if you pass the data as GeoJSON or GeoTIFF embedded in the request or as I really believe Collection Output is a key capability that should be considered for the OGC GeoDataCube API (potentially specifically in terms of OGC API - Coverages client / output support), because it provides access to an Area/Time/Resolution of interest from a data cubes exactly the same way regardless of whether dealing with preprocessed data, or data generated on the fly (through whichever process or worflow definition language). And Remote Collection / Collection Input allows to easily integrate that same capability in a workflow.
This all needs (a lot) more experimentation / implementations. Hopefully at upcoming Code Sprints and Testbed 20! :) |
My concern is mostly around the expectations for collections which I find hard to track. {
"collection" : "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a",
"properties" : [ "(B08 - B04) / (B08 + B04)" ]
} Looking at https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a, I can see that Map, Tile, Coverage, etc. are available for that collection. From the parent process https://maps.gnosis.earth/ogcapi/processes/PassThrough, we can see that My impression is that this abstraction, although it can simplify some workflow-building aspects, also makes it much harder to follow. For reproducible science, data lineage and interpretability of what happens behind the scene, it looks like a black box that "somehow just works". There are also more chances that something breaks in a long chain of processes, which becomes harder to debug because which data is transferred at each step is more complicated to predict. |
I need to fix this to a
Collection Input really works best together with Collection Output. With Collection Output, it is the data requests (e.g., Coverage / Map Tiles or subsets with scaling) that will determine which subset / resolution is being processed. For Sync / Async execution, the process would either:
Sorry again for the bad process description confusion -- PassThrough takes a Coverage or a Feature Collection.
The source sentinel2-l2a collection contains all the fields as described in its schema (a.k.a. range type). The output of PassThrough will contain a single field that is the computed NDVI value specified by the
How the server acting as a client accesses the coverage using Data processing intended for a coverage should never use Maps or Map tiles unless the process is specifically intended for that (e.g., a Map Tiling process), or if Maps are the only thing available (in which case it can still treat the red,green,blue channels as separate numeric values -- still not super useful).
I think it can really simplify things a lot, but also it makes workflows much easier to reuse with different data sources, area/time/resolution of interest, and different implementations and deployments. The same workflow should just work everywhere. So the Collection Input / Output approach is really focused on Reusability, though it might slightly impact Reproducibility (but again we can keep that in check with experimentation/validation). |
Ugh, that's a lot of text. Thanks though, although it's hard to fully follow the discussion with limited time available. Two points though:
Thanks for the link, I really struggled to navigate the documents in this repo. |
@m-mohr I have no idea what "OGC API - Processes digest STAC API" means. Can you explain further? OGC API Processes exposes "processes". Processes have inputs and outputs. Inputs and outputs have schemas which are defined using JSON-Schema. How exactly does STAC API play in this world? The STAC API, like the Records API, is the Features API with a predefined schema for the "feature" ... a "record" for Records and a STAC Item for STAC. I can see that STAC, like Records, can be used (in a deployment that uses a catalog and a Processes server) to correlate expected process inputs/outputs with available data products (i.e. matching input formats, bands, etc.) and visa-versa but that is a catalog function and is orthogonal but tightly related to Processes ... especially if you are trying to deploy something like a thematic exploitation platform. Is that what you mean? |
I think there are two cases for this:
Listing local collections only from processes, or listing processes that work with a specific collection, doesn't help with integrating processes / collections across different servers.
Again it's not about competing with STAC but complementing it. The idea is to align it as much as possible with STAC (at least as an option), where the differences are:
As we discussed previously, I would also like to look into a relational model / binary format (e.g., SQLIte -- see 6.3.6.1 STAC relational database) allowing to more efficiently retrieve all scenes and synchronize across multiple servers the metadata for available scenes based on new scenes available / updates by retrieving a delta from a previous synchronization. |
STAC API as an input for data, similar to or exactly how the Collection Input works in Part 3. @pvretano |
Why just as an option?
That's just an extension and doesn't need a Scenes API? You could easily add a coverage endpoint to a STAC API.
Same, why does this need a Scenes API? You could easily add a coverage endpoint to the item endpoint of a STAC API.
I don't understand this.
That's possible with the STAC Transaction extension for Items and Collections (both aligned with the OGC API Transaction extensions whenever possible).
So something like stac-geoparquet (binary format)? Or more like pgstac (database model for PostgreSQL)? |
@m-mohr Having a STAC collection option (which implies a two steps access -- metadata -> data) makes sense since this is already an established pattern, so we should add clarification text to that effect in Processes-3 Collection Input that STAC is one possible access mechanism. However, there would be an expectation that all referenced STAC assets in the collection popuplate a collection with a consistent schema (fields). I don't think STAC necessarily implies this for any STAC catalog? There may also be confusion about whether the schema of the collection describes the data (the coverage made up of all scenes / assets) or the metadata (the STAC records). We can potentially deconflict this with the schema profile mechanism of Features - Part 5. Our own experience trying to use a STAC API instance this way was that this did not work for the use case of sentinel-2 global overviews that we were trying to use it for. I had trouble making use of the filter capabilities (CQL2 not yet supported, and could not make sense of the STAC-specific option in that implementation), and the server would reject responses returning more than a few granules. If
Because we may want to support alternative representations of the Another use case is for example the AWS sentinel-2 that we are proxying has assets organized by granules, but we regroup multiple granules as a single scene. So we may want to expose a STAC API that lists the actual assets available on AWS for that collection at There is also a distinction between a "Scene" and a STAC asset: a single Scene implies all assets for all fields / bands of the collection schema, whereas a single STAC asset may be for a single band or multiple bands.
It's a "Scenes requirement class" for the "OGC API - Coverages", not a separate "Scenes API".
Same answer -- that's mostly what the Coverage Scenes requirement class does.
The scene-level metadata from the individual scenes at
We could potentially reference that in the requirement class. However, there are two different and equally valid use cases here being considered:
Yes what we implemented is probably similar to pgstac. |
@m-mohr whether or not a process accepts a STAC item as an input is, as @jerstlouis mention, an option. That is really a property of the deployed process and depends on the definition of the process -- over which the provider of the service may or may not have control (i.e. Part 2) . Of course, we can certainly define an optional conformance class for that. In addition, the OGC Best Practice for Earth Observation Application Package has a pretty in depth discussion of the interaction between STAC and OGC API Processes. Perhaps we can steal some material from that document. |
@jerstlouis @pvretano @m-mohr |
@fmigneault Regarding the "generic container types": This is certainly true and the reason I opened this issue is to solve that. I'd love to see a solution for this. In openEO we have some kind of a general description of which file formats are supported by a back-end for input and output operations (GET /file_formats), but that's relatively high-level and would probably not work well for OGC API - Processes (and CWL-based processes). Maybe we need a way to describe container formats and what they can contain? We have the same issue in openEO, where out principle is pretty much that everything that goes in is STAC and everything goes out is STAC. While with a good format abstraction library such as GDAL you can cater for a lot of things, we pretty much just have to throw errors during the workflow if something doesn't meet the expectations. On the other hand, openEO doesn't have as many steps in-between where you actually need to pass around STAC. It really just comes into play in openEO, if you switch between platforms, but not in-between processes that run on the same platform. That is a significant difference to OGC API - Processes, where this is fundamental even on the same platform. (Not judging whether something is better or worse, just trying to highlight the differences.) |
SWG meeting from 2024-03-04: Related to #395 A PR will be created that will likely close this issue as well. |
Part 3 offers a generic way to load from OGC API Collections.
A couple of uncertainties:
Disclaimer: I'm struggline with all these tiny files in the repo so I may just not have found the answers yet.
The text was updated successfully, but these errors were encountered: