Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity around array inputs? #373

Closed
m-mohr opened this issue Oct 25, 2023 · 41 comments · Fixed by #378
Closed

Ambiguity around array inputs? #373

m-mohr opened this issue Oct 25, 2023 · 41 comments · Fixed by #378

Comments

@m-mohr
Copy link

m-mohr commented Oct 25, 2023

If I want to provide an array of max. two strings, how am I supposed to do that?

  1. Purely JSON Schema:
inputs:
  example:
    schema:
      type: array
      maxItems: 2
      items:
        type: string
  1. maxOccurs:
inputs:
  example:
    maxOccurs: 2
    schema:
      type: string

Is it equivalent?

  1. And what does the following mean?
inputs:
  example:
    maxOccurs: 2
    schema:
      type: array
      maxItems: 2
      items:
        type: string

(Provided in YAML for simplicity)

@gfenoy
Copy link
Contributor

gfenoy commented Oct 25, 2023

Is it equivalent?

I think they are, yes.

And what does the following mean?

inputs:
  example:
    maxOccurs: 2
    schema:
      type: array
      maxItems: 2
      items:
        type: string

IMO, it means that your process is expecting at least one array with maximum 2 string items and support also, for this input, 2 arrays of maximum 2 string.

@m-mohr
Copy link
Author

m-mohr commented Oct 25, 2023

If the first two examples are equivalent, which would make it easy to "translate" between to variants easily, then I'd assume that my third example is equivalent to:

inputs:
  example:
    schema:
      type: array
      maxItems: 2
      items:
        type: array
        maxItems: 2
        items:
          type: string

But that would not match your description of it.

@pvretano
Copy link
Contributor

pvretano commented Oct 25, 2023

No they are not equivalent ... using @m-mohr example...

The first schema can result in an input like this:

example: [ "value1","value2" ]

The second schema can result in an input like this:

example:[["value1","value2"],["value3"]]

The "schema" object defines the schema of a single instance of an input value. If the single instance happens to be an array then so be it. The minOccurs and maxOccurs indicate the cardinality on the input which, in an execute request, are encoded as arrays expect for the special case where maxOccurs=1.

Both the input and output need more explanation and I have it on my todo list to update the specification which I am hoping to do before the code sprint.

@gfenoy
Copy link
Contributor

gfenoy commented Oct 25, 2023

If I recall properly, when there is no minOccurs set for an input, it means that it is required (implicitly "minOccurs": 1).

So as you did not have the "minOccurs": 0, to say the input is optional, then the equivalent version would be for the first example (Purely JSON Schema):

inputs:
  example:
    schema:
      type: array
      minItems: 1
      maxItems: 2
      items:
        type: string

Also the following is equivalent, in my understanding, to your second example (maxOccurs).

inputs:
  example:
    minOccurs: 0
    maxOccurs: 2
    schema:
      type: string

Sorry for saying at the first place that they were equivalent. It seems they are not completely.

@m-mohr
Copy link
Author

m-mohr commented Oct 25, 2023

@pvretano How can the second schema (I just added numbers above, to be very explicit here) lead to an array of arrays? The schema is type: string only.

@pvretano
Copy link
Contributor

pvretano commented Oct 25, 2023

@m-mohr as I said, the schema member defines the schema of a single instance of an input. In this case a string. However, the number of times that input can appear in an execute request is controlled by minOccurs and maxOccurs and the convention is that mulitple values for an input are encoded in a JSON execute request using an array.

@m-mohr
Copy link
Author

m-mohr commented Oct 25, 2023

@pvretano Sorry, but I don't get it:

How can the following schema:

inputs:
  example:
    maxOccurs: 2
    schema:
      type: string

allow the following input?

example:[["value1","value2"],["value3"]]

Where does the inner array come from? I really don't understand it.

@pvretano
Copy link
Contributor

@m-mohr sorry I may have gotten the schema numbers mixed up ... let me try again.

This schema ...

inputs:
  example:
    schema:
      type: array
      maxItems: 2
      items:
        type: string

leads to inputs like this:

example: ["value1","value2"]

This schema ...

inputs:
  example:
    maxOccurs: 2
    schema:
      type: string

leads to inputs like this:

example: ["value1","value2"]

So these two are equivalent.

But this schema:

inputs:
  example:
    maxOccurs: 2
    schema:
      type: array
      maxItems: 2
      items:
        type: string

leads to inputs like this:

example: [["value1","value2"],["value3"]]

Does this help?

@m-mohr
Copy link
Author

m-mohr commented Oct 25, 2023

Yes, this is what I expected, thanks.

I'm not sure why OAP deviated from JSON Schema and added min/maxOccurs instead of just using the min/maxItems, but if I can just translate from min/maxOccurs to minItems/maxItems with an array type wrapper in JSON Schema, I guess it works for me.

So if I spot schema 2, I'll just translate to schema 1 internally.

@gfenoy
Copy link
Contributor

gfenoy commented Oct 25, 2023

For this schema:

inputs:
  example:
    schema:
      type: array
      minItems: 1
      maxItems: 2
      items:
        type: string

There is still something missing, in case the example input takes only one value, it is not passed as an array.

So we should add a oneOf to add this option too, we then use the same type as the one for the item within the array.

To be complete, we also need to add the json object that can be used to pass a reference for an input.

@pvretano
Copy link
Contributor

pvretano commented Oct 25, 2023

@gfenoy I could be wrong but I don't think that is correct. Then single instance of the input is defined as an array so it will always be an array. If you specify 1 value that you still need to use an array...

example: ["value2"]

This is why I need to update the specification to clarify all this ... we have the schema in the specification but very little discussion about what they imply in relation to encoding an execute request.

I hope the others, @jerstlouis @fmigneault, etc. chime in so we can get concensus about this before I start writing. This issue is also related to the email that I sent to @gfenoy and other about this question. Similar issues arrise with the outputs too. I'm working on a PR to try an clarify all of this in the specification that should be ready soon but I would appreciate the input of other too so that I capture the consensus position.

@m-mohr
Copy link
Author

m-mohr commented Oct 25, 2023

Playing devil's advocate here, but why not just ditch min/maxOccurs and purely rely on JSON Schema?

@pvretano
Copy link
Contributor

pvretano commented Oct 25, 2023

@m-mohr not opposed to that but lets see what the others say. For some reason though, there is somthing in the back of my mind that says we did this for a reason but I can't recall why. I will have to dig into my notes again.

@gfenoy
Copy link
Contributor

gfenoy commented Oct 25, 2023

@pvretano I would be in favour of this move personally.

Are these comments #168 (comment), opengeospatial/ogcapi-routes#17 (comment) related?

@jerstlouis
Copy link
Member

jerstlouis commented Oct 25, 2023

So these two are equivalent.

That is not entirely true, because if using maxOccurs: 2, "value1" is still a valid value in addition to [ "value1" ], whereas if using maxItems: 2, the value must always be an array and a single value would always need to be passed as [ "value1" ]. That is the main difference, as well as the fact that the two can be combined effectively allowing two levels of arrays as @pvretano pointed out: [["value1","value2"],["value3"]]. That is how things are specified / should be interpreted for version 1.0.

why not just ditch min/maxOccurs and purely rely on JSON Schema?

When we discussed this originally for 1.0, it was already a huge step to adopt JSON schema at all (see #122, previously, it used a completely different set of properties to describe inputs and outputs) and the impression at the time was that dropping the separate minOccurs/maxOccurs was a step too far that would complicate things for clients. In hindsight, it probably makes things easier.

We were recently discussing this in #363. My proposal for a 1.1 version was to deprecate minOccurs / maxOccurs and encourage use of a schema array type with minItems / maxItems for inputs with multiplicity instead (with the default minOccurs = 1 / maxOccurs = 1 when it is not specified).
The ETS for 1.1 would give a warning (but not an error) for processes still using minOccurs / maxOccurs.
A 1.1 version would allow servers to support both 1.0 and 1.1 at the same end-point and allow for a seamless / smooth transition.

@fmigneault
Copy link
Contributor

I would love to have this explicitly detailed in the specification.

Because there was a lot of confusion in the past (as this thread shows) regarding input cardinality vs "single value array", CRIM's implementation evolved into trying to auto-detect minOccurs/maxOccurs and patch the corresponding representation in schema with minItems/maxItems and vice-versa. If an input specified minOccurs: 1/maxOccurs: 2, we would extend the schema with a oneOf such that it supports "value1", ["value1"] and ["value1", "value2"].

Using exclusively schema could be a good move to remove this confusion, but to remain backward compatible, processes would need to adjust their schemas without minOccurs: 1/maxOccurs: 1 to:

schema:
  oneOf:
    - <original-schema>
    - type: array
      items: <original-schema>
      minItems: 1
      maxItems: 1

And processes using minOccurs: 2 to the following to replicate what cardinality provided for a "single value array" :

schema:
  type: array
  items: <original-schema>
  minItems: 2

We need to consider very carefully how to handle cases of multiple nested arrays, so there is no ambiguity in process descriptions whether the I/O represent many single values or a single array value.

@pvretano
Copy link
Contributor

pvretano commented Oct 25, 2023

@fmigneault its on my todo list ... both the inputs and output are under-described!
Thinking about it since this issue popped up, I am not sure getting rid of minOccurs/maxOccurs at this time would be the right move as it would introduce quite a bit of disruption. Let me get the PR with minOccurs/maxOccurs ready and then we can discuss and adjust.

@jerstlouis
Copy link
Member

I am not sure getting rid of minOccurs/maxOccurs at this time would be the right move as it would introduce quite a bit of disruption.

This is why I am suggesting to simply deprecate it, as in discourage implementors to deploy new proceses that relies on minOccurs / maxOccurs (so they default to 1), and using instead type: array with maxItems:.

@pvretano
Copy link
Contributor

Actually, I don't think we can get rid of minOccurs/maxOccurs without moving wholesale to using JSON Scheme to define the inputs. Consider this input definition:

"inputs" : {
  "myInput": {
    "minOccurs": 0,
    "maxOccurs": 1,
    "schema": {
      "type": "string"
    }
  },
  .
  .
  .   
}

If we get rid of minOccurs and maxOccurs then how do we indicate that myInput is optional? The only thing I can think of is to go all in on JSON Schema and say that the value of "inputs" is an object defined using JSON-Schema that specifies all the inputs the process takes. That is ...

"inputs": {
  "type": "object",
  "required": ["myOtherInput"],
  "properties": {
    "myInput": {
      "type": "string"
    },
    "myOtherInput": {
      "type": "number"
    },
    .
    .
    .
}

Such a change is a breaking change and so we would have to go to V2.0. ... I think. Comments?

@m-mohr
Copy link
Author

m-mohr commented Oct 31, 2023

@pvretano Having a default value could indicate that a parameter is optional.

"inputs" : {
  "myInput": {
    "schema": {
      "type": ["string", "null"],
      "default": null
    }
  }, ...
}

or

"inputs" : {
  "myInput": {
    "schema": {
      "type": "string",
      "default": ""
    }
  }, ...
}

@jerstlouis
Copy link
Member

@pvretano @m-mohr I think we could simply deprecate maxOccurs, and any value except minOccurs: 0 and the default 1.

@fmigneault
Copy link
Contributor

If minOccurs needs to be preserved, then I would prefer to keep maxOccurs as well. Otherwise, we would have weird combinations of minOccurs out of schema and maxItems in the schema. The default: null proposal makes sense in my opinion if both are removed.

In order to reduce the ambiguity between array as a cardinality specifier and the JSON array container passed as a single value, we could disallow this kind of input for maxOccurs: 1:

inputs:
  input_single_value_array: [1,2,3]

Instead, a single-value JSON array would have to be explicitly nested under value just like it must be done in the case for JSON complex objects to remove the ambiguity with cardinality.

inputs:
  input_single_value_array: 
    value: [1,2,3]

And with this, the only case where an array could be directly provided under the input/output ID would be to represent cardinality. The following would always be equivalent and would assume maxOccurs>1:

inputs:
  input_min_occurs2_short_form: [1, 2, 3]
  input_min_occurs2_long_form:
    - value: 1
    - value: 2
    - value: 3

In the above example, each int can be replaced individually by anything, including a href or a complex JSON (including a nested array), without changing the interpretation of the first level array as the cardinality, contrary to the current definition that could mix the short form to be a "single value" that just so happens to be a JSON array.

@jerstlouis
Copy link
Member

@fmigneault with my suggestion, minOccurs would only be kept for the purpose of indicating optionality with minOccurs: 0.

Any other use would be deprecated along with maxOccurs. If you want an array with at least two elements, you would use minItems: 2.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 8, 2023

@pvretano

It just dawned on me that there is one (very important) aspect of that non-JSON Schema input multiplicity that might benefit from the minOccurs / maxOccurs approach.

The JSON Schema only applies to what goes into a direct value or qualified input value ("value": ...).

It does not apply to inputs that are "href" (input by online reference), "collection" (Part 3: Collection Input) or "process" (Part 3: Nested Process).

i.e., the JSON schema was intended to represent a "single input" which could be replaced by an online reference, a collection, or a nested process.

To clarify, if the JSON Schema says it's an array of multiple items, the type of the file at the href location, or what each collection or each process generates, would be multiple things.

So if we rely on an array for that purpose, it mixes things up quite a bit.
Not sure why I didn't realize this earlier, sorry.

@m-mohr
Copy link
Author

m-mohr commented Nov 8, 2023

What's the relation between min/maxOccurs and the "special" types anyway? What happens if I can only accept a single collection but multiple hrefs (e.g. multiple COGs)?

@jerstlouis
Copy link
Member

jerstlouis commented Nov 8, 2023

@m-mohr I don't think that would be possible.

The HREF are references to the one thing that the schema defines.
The collection really is special in that it can output data according to the schema.
Same goes for nested process invocations.

So in that particular case, at least for 1.0, I would make maxOccurs unbounded, declare the schema to be binary image/tiff; application=geotiff. Clients could pass a single collection or a single COG, or multiple COGs, or multiple collections (each able to output a GeoTIFF e.g., from /coverage with Accept: image/tiff; application=geotiff, and an optional subset), or one process invocation, or multiple process invocations.

@m-mohr
Copy link
Author

m-mohr commented Nov 9, 2023

That sounds like a confusing concept to me. How do I know anyway where I can use these special types (href, collection, process)? How do I know when I can pass one or multiple? It looks like the parameters don't describe it. Is it pure try&error?

@jerstlouis
Copy link
Member

jerstlouis commented Nov 9, 2023

@m-mohr

How do I know anyway where I can use these special types (href, collection, process)?

You can always use href as an alternative for any input. (See Requirement 18).

For process, you can also use this type anywhere if /conformance declares support for http://www.opengis.net/spec/ogcapi-processes-3/0.0/conf/nested-processes.

For collection, you can use this type for any collection input that is collection-compatible if /conformance declares support for http://www.opengis.net/spec/ogcapi-processes-3/0.0/conf/collection-input. Currently, an input is considered collection-compatible if its schema a media type or JSON object of any geospatial format that any OGC API data access mechanism could output (e.g., Coverages, Features, Tiles, DGGS, EDR, Maps, 3D GeoVolumes...) e.g., GeoTIFF, GeoJSON, CoverageJSON, netCDF, etc. We did discuss at some point having something else that indicates support as a collection, such as the format key. Inputs with geojson-feature-collection as a format would be collection-enabled. We may also want a format value for coverage.

How do I know when I can pass one or multiple?

This is the topic of this issue isn't it? Currently in 1.0, it is with maxOccurs.

Is it pure try&error?

No! :)

@m-mohr
Copy link
Author

m-mohr commented Nov 9, 2023

I don't buy that, sorry. You answers are somewhat conflicting for me. If you set maxOccurs to unbounded for the example then it's not clear whether I can pass one url, multiple urls, one collection, multiple collections, one value, multiple values. So it's try&error.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 9, 2023

@m-mohr If the process description has an input with maxOccurs: unbounded, then it is clear that you can pass for that input any of:

(as defined by Part 1: Core)

  • one value (e.g., base64-encoded coverage or GeoJSON object)
  • multiple values (e.g., base64-encoded coverage or GeoJSON object)
  • one "href" URL
  • multiple "href" URLs

(as defined by Part 3: Workflows & Chaining)

There is no trial & error. It's all clear.
What is not clear?

@m-mohr
Copy link
Author

m-mohr commented Nov 9, 2023

My process only accepts one collection and multiple HREFs. I can't encode that, you proposed to use unbounded maxOccurs. So it's try&error for my users/client?! For me that looks like a flaw in the spec.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 9, 2023

@m-mohr As I explained above, that is not a supported use case.

The collections and processes (and href) are a drop-in replacement for getting one input value.

So if you can receive something from multiple GeoTIFFs or from one collection, your server should be able to also easily retrieve one TIFF per collection that is passed?

(NOTE: You also need to support base64-encoded values for each TIFF, as crazy as that sounds :P I would complain more about that than about having to support multiple collections.)

@fmigneault
Copy link
Contributor

I somewhat agree with @m-mohr regarding collection and the ambiguity of extended formats provided by Part 3.

The usual intent for anyone using a collection is to obtain a collection of references.
Therefore, users expect that { "collection": "<URL-or-Query>" } will be mapped
to some resolved items [ {"href": "<URL-item1>" }, {"href": "<URL-item2>"}, ... ].

If the collection happens to return only one item, it is still going to be an array of a single href.
In other words, it returns the list of /collections/{collectionID}/items that are defined under the queried collection.
The same process could be called directly with the list of href instead of collection, and should produce the same result.
It is therefore expected that this process must know how to handle an array of resources.

My understanding of minOccurs/maxOccurs (correct me if I'm wrong), is that it represents only the cardinality when speaking in terms of value or href inputs/outputs (aka Part 1: Core).
The case of collection requires some kind of implicit conversion of 1->many, and process requires an implicit mapping of childProcess.{output}.min/maxOccurs -> parentProcess.{input}.min/maxOccurs to align.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 9, 2023

@fmigneault

The usual intent for anyone using a collection is to obtain a collection of references.

That is not at all what Part 3 - Collection Input is about. It's about bridging the OGC API data access Standards (Coverages, Features, Tiles, EDR, DGGS, Maps...) with OGC API - Processes, so that an OGC API data source can be the input to a process. It normally represents an infinite set of possible URL requests.

My understanding of minOccurs/maxOccurs (correct me if I'm wrong), is that it represents only the cardinality when speaking in terms of value or href inputs/outputs (aka Part 1: Core).

Part 3 extends Part 1 with additional types of inputs which are drop-in replacements for the value or href in Part 1 execution requests. Because collections are drop-in replacements for Part 1 value/href, the cardinality also means you can have one or multiple collections.

If the collection happens to return only one item, it is still going to be an array of a single href.
In other words, it returns the list of /collections/{collectionID}/items

A collection input of Part 3 is written as "collection": "https://server/ogcapi/collections/myCollectionID".

It does not include items, and the collection URI must return a Collection Description as defined in OGC API - Common - Part 2, with one or more links to access mechanisms (/items, /coverage, /tiles, /dggs...).

The server then mints its own URIs to request data as needed, which will take into consideration the area/resolution/time of interest (which facilitates also supporting Collection Output, since area/resolution/time of interest flows in from process-triggering requests also as OGC API data access mechanisms), and the overlap in capabilities in terms of formats and APIs supported by both ends of the hop. What it gets back from those minted URIs is defined by the relevant OGC API data access standards.

A typical example of cardinality applied to Collection Input is our RenderMap process:

https://maps.gnosis.earth/ogcapi/processes/RenderMap?f=json

You can provide one or more layers as either embedded values or href of GeoTIFF, GeoJSON, or nested process, or collections.
Each layer can be dropped-in replaced by a nested process or collection.
e.g., see example request at:

https://maps.gnosis.earth/ogcapi/processes/RenderMap/execution?response=collection

@fmigneault
Copy link
Contributor

I see, so collections I/Os are simply a hack around having an actual process that is a OGC-API client that knows how to interpret a Collection Description to obtain a relevant list of href. Looking at your example which includes https://maps.gnosis.earth/ogcapi/collections/NaturalEarth:physical:bathymetry, not only does that define a collection, but this collection itself has a nested list of collections, each potentially returning many different item of various formats (ie: https://maps.gnosis.earth/ogcapi/collections/NaturalEarth:physical:bathymetry:ne_10m_bathymetry_K_200?f=json).

Given that, I even further agree with @m-mohr.
There is way too much logic involved in converting collection inputs, parsing it correctly and producing alternative outputs.
To support collection, one has to basically support all existing OGC APIs... within OGC API - Processes, and the input basically becomes a massive "accept any cardinality and any media-type", and just hope that the process won't break at execution time.
A whole standalone process should be dedicated for that purpose, and chain the list of href it produces within a (real-atomic step) workflow.
The subsequent step should not have to worry about "how to interpret all OGC-API standards". It should just worry about its own core processing operation.
None of this be encoded easily, and are bound to cause inconsistencies between implementations.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 9, 2023

I see, so collections I/Os are simply a hack around having an actual process that is a OGC-API client that knows how to interpret a Collection Description to obtain a relevant list of href

Maybe now we better understand each other but agree to disagree ;)

It's about OGC API collection as first class objects that can be inputs to processes.
The "relevant list of href" is something dynamic.

not only does that define a collection, but this collection itself has a nested list of collections, each potentially returning many different item of various format

Correct, but the collection with id NaturalEarth:physical:bathymetry itself supports access mechanisms, including multi-layer vector Tiles and Maps (and Map Tiles).

To support collection, one has to basically support all existing OGC APIs...

Not all, only at least one. The more APIs (and formats, and CRS, and TileMatrixSets, and DGGRSs...) it supports, the more input collections it would be interoperable with.
When executing/validating the workflow (e.g., when submitting a request to https://maps.gnosis.earth/ogcapi/processes/RenderMap/execution?response=collection for collection output) you would get an immediate failure if a hop is unresolvable due to no overlap in client/server OGC API / format capabilities.

are bound to cause inconsistencies between implementations.

That should not be the case if implementations conform to the relevant OGC API standards.

I understand you have a different view on the best way to do this and are not a fan of "Collection Input" / "Collection Output" :)

But I strongly believe in the great potential value of it.
It enables chained processing that can be orchestrated purely in terms of OGC API data Standards servers & clients.
With (data) Tiles and DGGS (Zone Data "what is here?" and Zone Query "where is it?") in particular, this is very powerful!

@fmigneault
Copy link
Contributor

When executing/validating the workflow (e.g., when submitting a request to https://maps.gnosis.earth/ogcapi/processes/RenderMap/execution?response=collection for collection output) you would get an immediate failure if a hop is unresolvable due to no overlap in client/server OGC API / format capabilities.

If the first process in a chain generates an output collection URL, I don't understand how you can validate (before executing the process chain) that whichever reference it would generate will work with the second process receiving it. You don't have the value of the output collection URL at that point, nor any of the expected href it would refer to, so how can they validate that their supported formats match?

That should not be the case if implementations conform to the relevant OGC API standards.

That is extremely optimistic of you. 😄
I regularly find inconsistencies between implementations, even if they conform to the standards, since the standards are purposely vague on many aspects to allow flexibility. This thread is but an example of how implementations can diverge even if generally following the same rules.

I like the idea behind collections if they were more controlled. They have potential. At the moment, however, they feel like a .* regex for validating a string. There's a syntax and a rule in place, but pretty much so broad that it doesn't do any validation.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 10, 2023

If the first process in a chain generates an output collection URL, I don't understand how you can validate (before executing the process chain) that whichever reference it would generate will work with the second process receiving it.

The idea is that the end-user client sends the whole workflow chain to the process at the top of the workflow, and if executing with ?response=collection (collection output) that process will return either:

  • a 4xx error if the workflow does not work out, or
  • a Collection Description if everything is ready to roll, including one or more links to supported OGC API accesse mechanisms (Tiles, Coverages, DGGS, Features...)

Now the client sees whether it supports the supported API / formats listed in that collection description, and it itself will either be good with it or not.

Now that client can actually be any one of the servers for the nested processes in that workflow.

Any of the process servers receiving an input collection can also validate the advertised acceses mechanisms with what it itself supports.

So each client along the chain has a simple single request, gets back a collection description, and is either happy with the OGC APIs / formats / CRS or not.

That results in the end-user client either getting a 4xx or 303 (redirection to /collections/{collectionId}) collection description, and the client is either good with it or not. Therefore the whole chain is easily validated before any kind of processing starts, and then you have a well established workflow for which every node can preempt what's coming and easily cache tiled requests etc., for very fast and smooth distributed processing.

We need to test this more in actual implementations. I hope there's an opportunity in Testbed 20 to explore this further, and it would be great to work with you guys at CRIM to experiment further on this together as well.

I regularly find inconsistencies between implementations, even if they conform to the standards, since the standards are purposely vague on many aspects to allow flexibility.

If we have well defined requirements and abstract test suites, leading to good Executable Test Suites, surely there will be a good level of interoperability between implementations of the same standard :) I agree the original Processes - Part 1: Core still had quite a few ambiguities, but the point of this thread is to try to bring clarity and improve on the intended interpretation.

@pvretano To summarize on this aspect of href/collections/processes, if we are to deprecate maxOccurs in 1.1/2.0, what we would need to do is clarify that when substituting an input by an href (or a process or collection in Part 3):

  • an href (or process or collection) would be a substitute for the content the schema link to in the general case,
  • but a substitute for the element of the array if the top-level of a JSON schema is an array, or a oneOf of array or (object or binary type)

In the current 1.0, the href is a substitute only for the schema itself, without those slightly more complicated special rules, and the separate maxOccurs is used to handle that multiplicity.
Note that if we normally want the clients to be able to either pass a single object (or hrefs/collections/processes) or an array, explicitly describing that in the schema will be quite cumbersome, having to repeat the schema (which itself might be a oneOf of several formats etc.) twice. So perhaps minOccurs / maxOccurs was not such a bad idea after all, and we just need to clarify the ambiguity.

@fmigneault
Copy link
Contributor

chain is easily validated before any kind of processing starts

Not sure if I really agree with easily validated 😅

What I have in mind is the following Workflow (whichever representation used between openEO, CWL, or nested processes) :

  1. Process collection-fetcher is used to obtain an output URL formed as collection, based on multiple inputs used as queries. Imagine, for example, some filtering properties or an NLP-based query-search of collections in a STAC API or an OGC API Features. That process can take a while to run, can we don't know in advance what it yields.
  2. Process collection-processor accepts either a URL reference to a collection, or directly an array of href/value that corresponds to relevant items from that collection, to perform some kind of processing over multiple items (GeoJSONs, GeoTiffs, or any other data really).

When the workflow that encodes collection-fetcher -> collection-processor is requested for execution, collection-processor does not have any way to preemptively validate that the workflow steps will work, since we don't know yet at that point what collection-fetcher will generate (assuming what was mentioned previously that it simply knows how to interrogate OGC APIs). Once collection-fetcher would have executed, then yes we could validate, but not before. Since, according to these definitions, collection-fetcher could emit basically any collection reference to any OGC (or compatible) API standard, we can only trust that collection-processor would somehow resolve it correctly. This becomes an even bigger problem when the implementation supports both Part 2 and Part 3 (what I hope to achieve...). The Workflow must be pre-validated, because it will not be executed at all when deployed. It can only rely on Process Descriptions to validate that steps "align".

The only situation where I can foresee a collection resolution and validation between those steps, is if collection was only a different (shortcut) representation to an array of href/value. If a Process Description indicated that it supports an array of application/geo+json, then submitting either a single input collection URL that returns GeoJSONs or an array of href to all those GeoJSONs would be equivalent. Just to be explicit, an array of collection would become a 2D array of href, and so on. Otherwise, there is no way to tell the process if multiple collections should be flattened/aggregated or processed individually/parallelized. A robust workflow that could be validated before execution would be to replace collection-fetcher by some kind of geojson-fetcher that explicitly states it will return an array of GeoJSON. In a way, the exact same workflow could be implemented with CWL and openEO, regardless if the implementation supports collection or not. The collection becomes a convenience mechanism to chain APIs or provide a single URL by the client, but they must still agree on their I/O formats and min/max occurs.

Note that if we normally want the clients to be able to either pass a single object (or hrefs/collections/processes) or an array, explicitly describing that in the schema will be quite cumbersome, having to repeat the schema (which itself might be a oneOf of several formats etc.) twice. So perhaps minOccurs / maxOccurs was not such a bad idea after all, and we just need to clarify the ambiguity.

I agree.

@jerstlouis
Copy link
Member

jerstlouis commented Nov 11, 2023

@fmigneault

Not sure if I really agree with easily validated 😅

The thing to realize about the validation is that for each hop, the client (or server acting as a client) executes a simple operation that is well defined by an OGC API. Either it:

  • (for a nested process supporting Collection Output) POSTs a simple subset of the workflow to the next immediate server in the chain (and this is very easy, it simply cuts from { to } the JSON object for that process) to /execution?response=collection, and validates that the sub-workflow itself is validated by the server resulting in a returned collection description, then validates that the offered API/formats/tile matrix sets are supported by the client,
  • (for a collection input) simply retrieves a collection description and validates that the offered API/formats/tile matrix sets are supported,
  • (for a nested process not supporting Collection Output) retrieves the process description and assess that the inputs/outputs are compatible with the input that the client can send it and the outputs it can retrieve from it. If that nested process itself does not support Collection Input, but the calling client(itself a Processes server) does (and a collection input is used there), it needs to itself do the fetching to provide the relevant bit of the collection to the nested process (or potentially be permitted to refuse to validate). Our MOAWAdapter process handles that sort of scenario to mix Part 1-only and Part 3 implementations, and we will likely try to merge this with our PassThrough process.

BTW please have a look at https://gitlab.ogc.org/ogc/T19-GDC/-/wikis/OpenEO/OGC-API-Processes-Part-3-comparison (Testbed-19/OGC GitLab registration required)

@pvretano
Copy link
Contributor

13-NOV-2023: SWG concensus is to NOT deprecate minOccurs and maxOccurs in this version because there are some input cases that might not be handled clearly if we go to a pure schema approach. We need more implementation experienced with the schema approach so for now, we will proceed with minOccurs and maxOccurs in place. @pvretano will update PR #378 accordingly. @jerstlouis also mentione that we somehow need to emphasize that implementation must be prepared to handle inpput inline or by-reference. This is what Requirement 18 says! Also @pvretano will look into adding direct links to each requirement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants