-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design best practice/approach to publish results into external catalog #450
Comments
We have also discussed this in the EDC project and our proposal/solution was that we simply pass over the canonical link of the STAC metadata to the EDC service and they ingest the STAC metadata from there. The initial request to EDC is done via the client (here: The Web Editor) The main difference to your use case seems to be that:
If you are only interested in 2, the collection you are producing could be used to ingest the data into an external source. One question that arises is whether the back-end can easily access and store the data at various external sources? Or should this reside at the client? We could integrate such functionality at the client-level by reading the STAC metadata with e.g. PySTAC and then let users move the data over to an arbitrary host of their choice. Or maybe these are two different use cases (external host under control of back-end / under control of user)? Anyway, all solutions that are not directly part of the initial request would require temporary storage (point 1 above). Thoughts? |
We indeed want to avoid storing on the backend, because files can be very large and don't want to leave it up to the user to handle cleanup. So I would avoid requiring a /storage endpoint, as S3 is supported by most storage systems nowadays. So I'm leaning towards a export_files_s3 kind of option. |
The difficulty is the authentication, I assume? If you want to move files between a back-end and a S3 provider, the back-end needs to have a uniform way to authenticate with the S3 service or it needs to go through the user's local system, which is not ideal. Thoughts? If back-ends don't want to leave clean-up up to the users, would it make sense to clean-up by default but let users explicitly extend the storage time so that the results are stored for another month? Because fiddling with S3 or finding a back-end for it is nothing an average user is necessarily aware of or wants to know about. It is somewhat against the simplicity of openEO. I myself wouldn't know right now what and where to host my data if I'd like to (knowing AWS for universities is not an option due to a credit card requirement). Also, the Web Editor might get an "app-like" behavior soon where you can load results and details about it from a STAC URL, but for this people that moved over to S3 still need to figure out details like requestor pays and CORS issues, which is yet another difficulty, which you usually want to take away from users. |
@LukeWeidenwalker @christophreimer linked to 'users collection' concept? |
To answer questions from @m-mohr exporting to S3 would mostly target advanced users and projects that now also fiddle with S3, you are correct in saying it is less simple compared to other options we offer. With the new 'user workspaces' option that is being proposed, we would probably end up having something similar but also with some options to make it simpler? For instance, if user workspace supports OIDC authentication, the credentials issue may be solved? |
For the EU27 croptype map, openEO generates 10000+ products. These are all separate batch jobs with their own metadata, that in fact make up a single STAC 'collection'.
Instead of keeping this in openEO, I would prefer to publish these results immediately into an external STAC or opensearch catalog.
Requirements:
So I'd like to design an approach for this. One way is to add various options to save_result, or do we perhaps need an 'export_result'?
The text was updated successfully, but these errors were encountered: