Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify two Workflows that are good examples of the types of Workflow that EarthCODE should support. #17

Open
edobrowolska opened this issue Jun 19, 2024 · 4 comments
Assignees

Comments

@edobrowolska
Copy link
Collaborator

Please consider the following:

  1. What format are the workflows (how are they defined and do they conform to a standard etc)?
  2. Are they machine executable?
  3. Could they be machine executable?
  4. What format are the products they produce?
  5. How are the Workflows maintained and used?
  6. Does the current EarthCODE SOW support the typical needs of these workflows (have we missed functionality needed by these Workflows)?
@edobrowolska edobrowolska self-assigned this Jun 19, 2024
@edobrowolska
Copy link
Collaborator Author

edobrowolska commented Jun 19, 2024

Following examples have been identified:

Annual mass budget of Antarctic ice shelves from 1997 to 2021

All code required to reproduce the results presented in Davison et al. Annual mass budget of Antarctic ice shelves from 1997 to 2021. Also provided are the ice shelf masks and 500x500 m basal melt rates.
Access: Data and code for: "Annual mass budget of Antarctic ice shelves from 1997 to 2021" (zenodo.org)

1. Workflow format:
a. Workflow is in the .txt format (workflow.txt file inside .zip file) with description of each step of the analysis written in human readable language.
b. It is not written in standardized way. Author refers to each step as a list of procedures, providing names of the files to execute and references, paths etc.

2. Are they machine executable?
The workflow itself is not machine executable. But each component of the workflow yes.

3. Could they be machine executable?
• Each component (stage) described in the workflow is machine executable.
• Each component (step) of the workflow is stored in a separate .m file
• Some input data described in the workflow is missing.
The code e.g. make_ice_shelf_grounding_line_flux_gates.m requires some specific shapefiles which are not provided in the repository:
o ice_shelf_masks/complete/minimum_ice_shelf_mask_Antarctica_BJD_v03.shp
o GroundingLine_Antarctica_v02.shp
o Basins_IMBIE_Antarctica_v02.shp
• Missing input data from these Matlib scripts must still be checked

4. Workflow output format:
Products produced by this workflow are in different formats:
• Final product: basal melt in .tif format
• Melt rate comparisons: .png
• Timeseries: .csv
• Plots: .png
• Individual masks: .mat
• Merged masks: .shp

5. How are the Workflows maintained and used?
Workflow is maintained together with files with code and output data in zip folder stored in zenodo persistent repository.

Considerations:
• At the moment only final product is stored in the OSC cloud (s3 bucket)
• The solution to transfer remaining data to ESA cloud repository (masks, plots, time series) should be provided as well. Since at the moment entire dataset is stored in .zip file.
• The solution must be provided to convert the human readable workflow in txt file into machine executable workflow (by connecting all the steps which are provided in executable format (.m).

Supraglacial lakes and channels in West Antarctica and Antarctic Peninsula during January 2017

The mapped supraglacial lake and channel polygons are available on Zenodo (https://doi.org/10.5281/zenodo.5642755, Corr et al., 2021) as digital GIS (Geographic Information System) shapefiles (.shp), Keyhole Markup Language (.kmz) zipped files and GIS GeoJSON files.
The code used to produce the lake and channel dataset for each sensor (S2 and L8) is written in Python and can be accessed on Zenodo (https://doi.org/10.5281/zenodo.4906097, Corr, 2021).
The datasets consist of the final lake and channel polygon maps for both sensors combined (i.e. our final maximum extent map of supraglacial hydrology) plus polygons for each sensor:
L8 (17 571 individual polygons) and S2 (23 389 individual polygons). In addition, predictor data for each sensor (i.e. the data tiles containing all bands for S2 and L8) are provided for each of the polygons. The code used to produce the lake and channel dataset for each sensor (S2 and L8) is implemented using Python, and can be accessed on Zenodo (https://zenodo.org/records/4906097) . Landsat-8 and Sentinel-2 imagery are freely available at (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fearthexplorer.usgs.gov%2F&data=04%7C01%7Ccorrd%40live.lancs.ac.uk%7Ce16045ed14e34f2cb4f108d92b70565e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637588586902880130%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=RidpbAMFz28isbZM6vNZWPMTdl3bl5OxO3SVWvBu6MQ%3D&reserved=0) and (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fscihub.copernicus.eu%2F&data=04%7C01%7Ccorrd%40live.lancs.ac.uk%7Ce16045ed14e34f2cb4f108d92b70565e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637588586902880130%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=lZINlehD3i%2BN%2BPSVZgSJnZa%2FruFq2vGHoEnkQGMmq%2Fg%3D&reserved=0), respectively.
Access: https://zenodo.org/records/5642755

1. Workflow format:
a. Workflow is provided at dataset entry as another link to separate DOI zenodo link.
b. Workflow can be downloaded as a .zip file containing Redme.md file, Licence txt file and two executable files containing python scripts with executable code. diarmuidcorr/Lake-Channel-Identifier-v1.0.zip
c. Workflow can also be accessed via GitHub (link provided as well): https://github.com/diarmuidcorr/Lake-Channel-Identifier/tree/v1.0

2. Are they machine executable?
Workflow is machine executable: after download or direct access from GitHub (written in Python with additional comments)

3. Could they be machine executable?
Yes, after downloading the workflow it can be executable by a machine. While accessing the workflow from GitHub it can be immediately accessible and executable.

4. Workflow output format:
a. Several types of products are produced: The supraglacial lake and channel polygons as digital GIS shapefiles (.shp) and GeoJSON files as well as Google Earth format (.kmz).
b. The maximum extent of supraglacial lakes and channels dataset for each sensor (S2 and L8) in GeoTIFF format. Output files are stored in .tar.gz folders each containing vector + GeoTIFF data.

5. How are the Workflows maintained and used?
Workflow is maintained together with files, code and output data in zenodo repository. The workflow is also maintained in GitHub, where it is accessible to users.

Considerations:
• Should the dataset be transferred to ESA cloud, given the fact that the repository online hosts zip files, which are not accessible for on-cloud operations?
• Should the workflow (.py) files be transferred to ESA Cloud as well (stored in .zip format at the moment)
• Should single STAC Item be created for each single tiff? Or it can be stored as a general link to the repository?

@edobrowolska
Copy link
Collaborator Author

I have placed this summary also in document here, together with other 2 examples demonstrating another possible workflows. I shared here the ones that I find demanding and complex examples to start with, as they include different data types and different workflows.

EarthCODE-workflow-examples-issue#17.docx

@GarinSmith
Copy link

Thanks @edobrowolska .
@rconway and I met with Anglos today to discuss these general concepts further and I have updated my notes for discussion tomorrow.
I have attached them here for reference.

EarthCODE APEX EOEPCA Workflows Approach.pptx

@edobrowolska
Copy link
Collaborator Author

Thanks @GarinSmith. I also update here on point 6 of the workflow analyses:

Workflow integration:

  • other languages supported by the workflow should be considered: e.g. Julia, Matlab, bash; various operating systems should be considered as well (e.g. creating environments for Windows, Linux etc.)
  • Some projects still use toolobxes as SNAP, PolSAR Pro, QGIS - how to share the workflow from such tools?
  • Workflows should support different data types: from vector to geotiff raster data to visualizations (jpg, png).
    Workflow flexibility
  • Within the experiment user should be able to combine different workflow with different algorithms, programming languages etc. (one for analyses and another for visualization and writing results).
  • Results (products) of one workflow (one experiment) should be enabled to be used in a different workflow (with different algorithms etc).
  • Parametrization of the workflow (with custom parameters, changed variables should be supported.
    User feedback
  • Feedback from the users (experts and non-experts should be provided as well.
  • User should decide if the experiments is final and could be published with new doi assigned.
    Workflow storage
  • Each experiment should be saved using unique id. The storage repository for the code and scripts must also be provided with the guidance in which for data should be stored.
  • Guidelines on when transferring of the workflow (from data provider storage) to ESA cloud archive is necessary should be provided together with guidelines on the file naming convention.
  • Workflow should be discovered together with the product produced by this workflow and similar workflows or previous version of it.
  • Should have the possibility to download to local repository and be able to execute it.
    Workflow creation and guidelines:
  • To ensure that the workflow can read input datasets correctly the same level of the details should be provided to all input dataset and the same granularity supported.
  • user should be able to decide if the workflow is ready for being released or must remain hidden (for intermediate steps)
  • The progress with uploading and ingesting the workflows by contributor should be tracked and monitored, providing the assistance where needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants