Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal OGC API - Processs - Part4: Job Management #437

Merged
merged 31 commits into from
Oct 18, 2024

Conversation

gfenoy
Copy link
Contributor

@gfenoy gfenoy commented Sep 23, 2024

Following the discussions during today's GDC / OGC API—Processes SWG meeting, I created this PR.

It contains a proposal for an additional part to the OGC API - Processes family: "OGC API - Processes - Part 4: Job Management" extension.

This extension was initially discussed here:

The document identifier 24-051 was registered.

@gfenoy gfenoy added the Part 4 (Job Management) OGC API - Processes - Part 4: Job Management label Sep 23, 2024
Comment on lines 1 to 3
type: object
additionalProperties:
$ref: "input.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using an alternate representation:

inputs:
  type: object
  additionalProperties:
    $ref: "input.yaml"
outputs:
  type: object
  additionalProperties:
    $ref: "output.yaml"

Reasons:

  1. Although "outputs" are shown, those represent the requested outputs (i.e.: transmission mode, format, or any other alternate content negotiation parameters) submitted during the job creation request, in order to eventually produce the desired results. Often, the requested outputs depend on whichever inputs were submitted. Therefore, viewing them separately on different endpoints is not really convenient or useful.

  2. The /jobs/{jobId}/outputs endpoint can easily be confused with /jobs/{jobId}/results. The "request outputs" in this case are "parameter descriptions of eventual outputs", which are provided upstream of the execution workflow. In a way, those are parametrization "inputs" of the processing pipeline.

  3. Because OGC API - Processes core defines specific response combinations and requirements for /jobs/{jobId}/results, the /jobs/{jobId}/outputs is a convenient and natural endpoint name that an API can use to provide alternate response handling and representations conflicting with OGC API - Processes definitions. CRIM's implementation does exactly that. I would appreciate keeping that option available.

  4. As a matter of fact, older OGC API - Processes implementations (start of ADES/EMS days) actually used /jobs/{jobId}/outputs instead of /jobs/{jobId}/results. Adding /jobs/{jobId}/outputs would break those implementations.

  5. Having inputs and outputs nested under those fields (rather than at the root) allows providing even further contents that could be relevant along the inputs/outputs. For example, additional links, metadata or definitions to describe those parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comment about inputs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a $ref to openEO's definition, to avoid maintaining duplicate definitions?

Comment on lines +3 to +5
enum:
- process
- openeo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be made a simple string with examples?
Do we want to create the same situation as statusCode needing an override because of the new created status?

If the long term use is to have job management available for an OGC API, maybe it would be better to define a Requirements class that say for openEO, type: openeo MUST be used, and process for OGC API - Processes. A "long" Coverage processing could then easily define their own requirement class with type: coverage, without causing invalid JSON schema definitions.

Comment on lines 7 to 10
id:
type: string
processID:
type: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements need to be revised. They used the alternate name jobID when referring to the GET /jobs responses.

Similarly, process was mentioned during the meeting.
I'm not sure if processID remains relevant however, because process would be the URI, not just the processID from GET /processes/{processID}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this clause say that say that there is no encoding for a specific job definition. Good!

It then indicates that this Standard includes two conformance classes ... One for "OGC - API - Processes - Workflow Execute Request" and one for "OpenEO Process Graph/UDP". Also Good,

However, what about an execute request from Part 1 that executes a single process? Shouldn't I be able to post an execute request to create a job the executes a single deployed process? ... without all the workflow dressing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should be able to execute a single process as well since it would like similar to a ""OGC - API - Processes - Workflow Execute Request", but just without any nested process. I think the same schema can be used directly for validation, but an explicit mention of single process could be added to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfenoy @fmigneault I think Part 4 needs to be reorganized so that the "core" is agnostic about what payload creates the job. I think this is already that case but maybe its not as clear as it could be. There should then be three conformance classes defined:

  1. A conformance class that binds Part 4 to Part 1 (i.e. creating/managing a job by executing a process),
  2. A conformance class that binds Part 4 to Part 3 (i.e. creating/managing a job by executing a Part 3 OGC workflow/chain),
  3. A conformance class that binds Part 4 to OpenEO (i.e. creating/managing a job by executing an OpenEO UDP)

Other standards can then add their own coformance classes that bind Part 4 to their particular requirements (e.g. covreages).

The fact that a Part 3 workflow, in its simplest form, decomposes to a Part 1 execute request is neither here nor there. Someone who only implements Part 1 and Part 4 should need to rely on Part 3 in any way.

Now that I've more-or-less finished working on Records I'll switch gears to Processes and make suggestions to Part 4.

Copy link
Member

@jerstlouis jerstlouis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Part 3 contains the details of both Part1-execution request-style workflow definitions (whether extended with nested processes or not) as one conformance class, and the OpenEO stuff as a separate requirement class (there is already an OpenEO conformance class in there, as well as a CWL one)?

Then Part 4 could be this "Common" job management thing that is not specific to Processes at all, as we had discussed.

The "Collection" input / output stuff could be moved to another Part 5 if that is causing too much confusion to have it also in Part 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerstlouis I think someone should be able to implement Part 1 and the advanced job management as per Part 4 without having to reference Part 3 (which deals with workflows) at all. The common or core part of Part 4 would eventually move to Common and Part 4 would simply be the bits binding the Common job managmenet stuff to the Processes specific payloads as I described above.

Copy link
Member

@jerstlouis jerstlouis Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault

I think that the fact the process/workflow employs a "Collection Output" should be encoded in the submitted job.

I very strongly disagree with that.
The whole idea of Collection Output is that it is an execution mechanism (as a reminder, despite the name, this is the concept of "Virtual Collection" with on-demand AoI/ToI/RoI processing -- not the the ability of a particular process or workflow to "output a collection" i.e., NOT uploading the results somewhere after a batch process is complete), implemented by the processing engine, irrespective of any particular process or workflow.

It is not something that the individual process or workflow output needs to worry about at all.

would cause conflicts with the multiple APIs that already use this for creating collections without any involvement from OAP.

This is resolved easily by defining a media type or content-schema / profile for the particular type of workflow definition, such as OGC API - Processes execution requests (Content-Type:, Content-Schema:, Content-Profile: header or ?profile= parameter for the POST /collections request).

Not sure to understand what you have in mind. Is this related to the ?response=collection query parameter? If

Yes, this is currently defined in the "Collection Output" requirements class of Part 3.

If so, why not just POST /jobs/{jobId}/results?response=collection, or even better, reuse response body parameter that was already available in OAP v1 (https://docs.ogc.org/is/18-062r2/18-062r2.html#7-11-2-4-%C2%A0-response-type) when submitting the job?

I thought we agreed in one of the recent meetings that this Collection Output (Virtual Collections) would NOT use Part 4 jobs at all, leaving /jobs for sync and async execution. I am focusing on this POST to /collections approach, potentially eventually implementing a POST to /jobs for the "sync" output, but I have no plan to implement /jobs/{jobId} or anything after that path anytime soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea of Collection Output is that it is an execution mechanism [...]

Then, I STRONGLY recommend renaming it and moving it out of Part 3, because that is both confusing and misleading, especially when the document also presents "Collection Input" as an actual process input.

If they are "On-Demand Virtual Collections", then just name them like this...

[about ?response=collection parameter]
Yes, this is currently defined in the "Collection Output" requirements class of Part 3.

Then, I'm even more confused about why you consider this an "execution mechanism". It looks to me like it is only a specific way to return the output with a guarantee it will be a collection endpoint where you will have a certain set of collection querying/filtering capabilities. I don't see how that is any different from any process that already returns a URI to a STAC/Features/etc. collection as output. Whether that collection URI is a static or virtual collection should not matter at all, and whether any "on-demand" processing must occur to accomplish the query/filtering shouldn't matter either.

I thought we agreed in one of the recent meetings that this Collection Output (Virtual Collections) would NOT use Part 4 jobs at all

Agreed, because "Virtual Collections" have nothing to do with Job definitions.

An on-demand processing could trigger one or many job executions to monitor a virtual collection querying/filtering operation, but no "Job" would be defined to contain the "on-demand" trigger condition itself. Jobs are not pub/sub channel definitions. They are the instantiation of a certain trigger being realized.

"Virtual Collection" with on-demand

I'm not quite sure what that actually changes in the context of OAP (whether Part 3, 4 or whatever).

If some processing is triggered by an input directive when querying the collection, that trigger should perform any relevant workflow processing as if it was submitted on POST /jobs, and publish the result to update the "Virtual Collection" wherever that is located. Which "processing workflow" that should be triggered in such cases should be a property defined under the Collection definition itself. However, the OAP /jobs would not contain any entry about this collection until an access query is actually performed. The POST /jobs gets called "on-demand" when the GET /collections/... operations that need the workflow processing happen.

Copy link
Member

@jerstlouis jerstlouis Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially when the document also presents "Collection Input" as an actual process input.

It is exactly the same thing for Collection Input, it is not a particular input to a process, but an additional way in how content can be provided to a process irrespective of the particular process (an alternative to the in line "value" and "href" mechanisms to provide input to a process). The actual processes invoked by the processing engine (e.g. the Part 2-deployed Docker containers) would never see the collection URI -- they would only see the blob of data coming in, and go through exactly the same code whether that came in as an "href", as a "value", or as a "collection" in the input.

I STRONGLY recommend renaming it and moving it out of Part 3

We could potentially move the requirements classes Collection Input / Output stuff, and the associated input/output field modifiers, to a Part 5 if this helps.
They were in Part 3, because they are part of the syntax of the execution requests workflow definition language (extended Part 1 execution request).
But they could be considered an extension defined in a separate Part 5 if that helps. They were always separate requirements classes from the "Nested Process" / "Remote Processes" requirements classes.

Then, I'm even more confused about why you consider this an "execution mechanism".

Because the actual way to "execute" the process is to:

  • A) Instantiate the virtual collection by POSTing the execution request to /collections
  • B) Perform an OGC API data request using one of the available access mechanisms on that virtual collection (e.g., OGC API - Coverages, Tiles, DGGS, Features, EDR...)

The client never needs to POST to /jobs or to /processes/{processId}/execution, and suddenly all data visualization clients magically become processing-enabled.

I don't see how that is any different from any process that already returns a URI to a STAC/Features/etc. collection as output.

It is a completely different paradigm. It's not something that is implemented in a per-process way, it's the processing engine that supports executing processes for specific AoI/ToI/RoI, and will itself execute the processes on the data subset as needed. Rather than the input data having to already exist, then the workflow being executed on it, then the processing engine uploading the results somewhere, then notifying the client getting a notification and retrieving the data, it's a "pull" mechanism where the client sets up the pipeline, then pulls on the output, and everything flows magically. It's similar to the piping mechanism in UNIX. This paradigm avoids all of the problems with batch processing, and allows for instant on-demand processing, minimizing latency. It also easily works in a distributed manner because once this is implemented:

  • any OGC API collection anywhere can be used as an input to a process,
  • any OGC API process or workflow anywhere can be used as an input to another process (because it can produce a virtual collection),
  • any OGC API client (e.g., GDAL) is able to easily access data from a process or workflow by simply doing a single POST operation to create the virtual collection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exactly the same thing for Collection Input, it is not a particular input to a process, but an additional way in how content can be provided to a process irrespective of the particular process (an alternative to the in line "value" and "href" mechanisms to provide input to a process)

I read this, and I only see contradictory messages. It's supposed to not be a particular input, but at the same time is an alternative to value/href, which are particular inputs. What?

irrespective of the particular process

This is the case of every input. Not sure what is the message here either.

processes invoked [...] would never see the collection URI -- they would only see the blob of data coming in

This is the exact same procedure for an href that would be downloaded if the process needs the data referred by the URI to do its job. The only thing collection allows on top of a usual URI is to auto-resolve some access mechanism and apply additional queries/filters. But, in the end, it is exactly the same. Data gets downloaded after negotiation, and what the process actually sees is a blob of data. So again, no difference. Therefore, no, "Collection Input" is not the same as "Collection Output" if you consider "Collection Output" to be an execution mechanism. It doesn't make any sense to mix them up.

Because the actual way to "execute" the process is to [...]

I agree with all that. Hence, my point. What does it have to do with OAP? There is no /collections in OAP. If OAP calls a collection as input/output, why should it care at all if that collection needs to do any processing behind the scene. From the OAP point of view trying to resolve a workflow chain, it should not care at all what that server does in its backend to serve the requested data from the collection reference.

It is a completely different paradigm [...]
Rather than the input data having to already exist [...]

I fail to see how it is any different. A process returning a STAC collection URI can do all of this as well. Why is it assumed that a URI returned by a such a process would be an existing collection? Why do you assume that URI could not also trigger on-demand processing when its data is requested from it? It can already be acting like a "Virtual Collection" conceptually, without any conformance classes trying to describe it as something else that is "special". Everything mentioned is already achievable.

The only thing relevant IMO is the collection as input, to indicate it is not a "simple" download request of the URI. What each node/hop in the chain does afterward when the relevant data is identified from the collection is irrelevant for the "local" processing engine. It is the "problem" of that remote collection to serve the data as requested. If that triggers further processing under the hood for that remote server, it is also up to it to execute what it needs to do, however it wants. The local processing engine just needs to worry about what it receives in return, regardless how it was generated, and pass it along to the next step in the chain.

Anyway. This is getting off track for this PR, so I will stop here.
I've already mentioned all of this many times in #426, and it still makes not sense to me.

Copy link
Member

@jerstlouis jerstlouis Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting off track for this PR, so I will stop here.

This "PR" is a whole Part. Normally, this would be a separate Project with multiple issues for discussing multiple aspects.

Part 4 is tightly coupled with Part 3, because it proposes /jobs as the new end-point for workflow execution, which Part 3 was already doing using /processes/{processId}/execution, which we would probably drop in favor of this Part 4 and the POST /collections for Virtual Collection execution. So we also need to discuss these other things in Part 3 (Collection Input/Output) to try to disentangle all that and see whether they should end up in a separate Part 5.

It's supposed to not be a particular input, but at the same time is an alternative to value/href, which are particular inputs. What?

By "particular input", I meant an input defined as a String URI type in a particular process, which the process will understand to be an OGC API collection. I meant that "collection" is a first-class input type, like "href" and "value", which the implementation of processes do not have to handle themselves since they're taken care of by the processing engine.

This is the exact same procedure for an href

There is a lot of similarity, yes. Good that we are on the same page on that.
However, an href cannot imply an arbitrary format, AoI/ToI/UoI, or access mechanism, because in an href all that needs to be hardcoded. Therefore a "collection" input only implies a particular collection, but represents "the entire collection", as opposed to a particular subset.

What does it have to do with OAP? There is no /collections in OAP.

We talked about moving this "Collection Output" requirements class currently in Part 3 which defines the POST to /processes/{processId}/execution to an "OGC API - Processes - Part 5: Virtual collection output and remote input collections" and changing this to a POST /collections instead, where the payload is an OGC API - Processes execution request (which can contain a workflow as defined in Part 3 "Nested Processes").

it should not care at all what that server does in its backend to serve the requested data from the collection reference.

The processing engine receiving the POST (execution request) to /collections does more than that:

  • It first validates the workflow definition and sets up the virtual collection
  • It can connect data access request (for the virtual collection output) to trigger processing (whether remote or local)
  • It has a concept of implied AoI/ToI/RoI parameters, which may map these automatically to e.g. a "bbox" parameter part1-style processes require it explicitly as a parameter
  • It can be a Processes - Part1 client to chain remote processes (the current "Remote Core Processes" requirements class in part 3)
  • It can retrieve the relevant subset using OGC APIs from input collections

A process returning a STAC collection URI can do all of this as well. Why is it assumed that a URI returned by a such a process would be an existing collection? Why do you assume that URI could not also trigger on-demand processing when its data is requested from it?

The assets of a STAC collection where the actual data is located are different URIs.
While it is technically possible to have separate items/assets for separate RoI/ToI/AoI/bands/formats, e.g. implementing a resource tree like OGC API - Tiles (/tiles/{tmsId}/{level}{row}{col}) and OGC API - DGGS (/dggs/{dggrsId}/zones/{zoneId}/data), implementing this as a virtual STAC collection in practice would require to list each of these as separate items/assets, and identifiying relevant items/assets would be a pain and inefficient without the STAC API. If the STAC API is available (and works as expected -- multiple STAC API implementations still have issues with their implementation of Features & Filtering/CQL2 even though they claim conformance), then yes that could work. So then the STAC API + datacess would make it a valid OGC API access mechanism for that virtual collection.

I fail to see how it is any different.

The big difference with "Virtual Collection Output" is that a data access client can simply do one POST (execution request) to /collections and get a virtual collection URI back (which did not exist beforehand -- the client just created it when setting up teh workflow) and then proceed as usual, as if the collection was a regular OGC API collection (whether STAC , or Coverage, or Tiles...). That makes it super easy to integrate this capability in OGC API data access clients like GDAL / QGIS.

It is the "problem" of that remote collection to serve the data as requested.

In full agreement there.

The local processing engine just needs to worry about what it receives in return, regardless how it was generated, and pass it along to the next step in the chain.

It also needs to know what subset to request from the input collection / remote processes to fulfill its own requests (since a virtual collection is not restricted/hardcoded to a particular AoI/ToI/RoI/...) -- that's how the processing chain flows down from the client pulling on the output to the source data.

I've already mentioned all of this many times in #426, and it still makes not sense to me.

We've discussed all this at length for 4 years in what will eventually be over 100 GitHub/GitLab issues :) Sometimes I feel like we understand each other and are in strong agreement. I'm not sure what you're saying makes no sense to you, but hopefully we can continue discussing to address the remaining or new misunderstandings, and get back on the same page :)

fmigneault added a commit to crim-ca/weaver that referenced this pull request Oct 15, 2024
@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 17, 2024

@gfenoy I've been working on implementing job management, and while looking at the PR to apply a comment, I noticed there is no openapi/paths/pJobs file or similar defining the POST /jobs, or any of the other endpoints added under /jobs/{jobId}/....

Thanks a lot for pointing this out. I have drafted some of these missing files and they are now available from there: https://github.com/GeoLabs/ogcapi-processes/tree/proposal-part4-initial/openapi/paths/processes-job-management.

I would like to stress that I only included the added method and did not include the previous one that should be added.

Maybe we can update the update.sh script later on, to concatenate the files with the same name in the original processes-core directory.

@m-mohr
Copy link

m-mohr commented Oct 18, 2024

@gfenoy What do we do with the unresolved comments in this PR?

@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 18, 2024

@m-mohr for unfinished dicussions please use the issue system and tag your issue with Part 4 (Job Management).

We think that it will ease organizing the work and discussions this way.

@fmigneault
Copy link
Member

@gfenoy @pvretano @ghobona
I don't mind going through the PR and creating the issues, but I do not have enough permissions to apply the labels. Can someone grant me this access?

@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 18, 2024

I cannot grant privilege to contributors of this repository.

@m-mohr
Copy link

m-mohr commented Oct 19, 2024

Hmm, okay. I had hoped that if we review a PR at least some of the review comments would be considered and be updated in the PR. Not even typos with actual commitable suggestions were merged. The procedure seems not optimal. Now I need to spam issues copying text that may lack context. But anyway, I opened issues, added a prefix to the title as I can't assign labels.

@ghobona
Copy link
Contributor

ghobona commented Oct 22, 2024

The labels appear to have been applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Part 4 (Job Management) OGC API - Processes - Part 4: Job Management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants