Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Engineering with OpenEO - Use Case 1 #190

Open
earthpulse opened this issue Jun 10, 2024 · 20 comments
Open

Feature Engineering with OpenEO - Use Case 1 #190

earthpulse opened this issue Jun 10, 2024 · 20 comments
Assignees
Milestone

Comments

@earthpulse
Copy link
Owner

feature engineering for parcels in eurocrops (temporal aggregation on some indices, for example)

  • openEO should get TDS from eotdl
  • user should define processing pipeline (with openEO directly or abstract in etodl?)
  • exectue process in openeo backend
  • ingest outputs in eotdl as a feature store
@Patrick1G
Copy link
Collaborator

Patrick1G commented Oct 9, 2024

@jdries @juansensio @jamesemwheeler
Here is a more detailed specification of this use case:

As a user, I want to make use of the EuroCrops dataset in EOTDL, create a filtered subset (EOTDL functionality) and use openEO from within EOTDL to generate predictive features from S1 and S2 time series, then train a model in EOTDL, and use run inference with that model in CDSE.

  1. find and explore the EuroCropsDataset, stage it in the EODTL workspace
  2. filter the EuroCropsDataset dataset using EOTDL functionality, to create a subset of parcels,
    e.g., 8 crop classes, each with 1000 examples, for one country
  3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
  4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
  5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

@juansensio
Copy link
Collaborator

Define the list of features that we want to compute for this task.

We can reuse the S1 and S2 pipelines from world cereal (features already validated).

@HansVRP
Copy link
Collaborator

HansVRP commented Oct 30, 2024

Below I share an example on how we typically access custom STAC collections:

openeo-community-examples/python/LoadStac/load-stac-item-example.ipynb

@HansVRP
Copy link
Collaborator

HansVRP commented Oct 30, 2024

The example provided in:
https://github.com/earthpulse/eotdl/blob/main/tutorials/notebooks/forest-map.ipynb

Feels like a more natural approach and a workflow we could provide as well.

@juansensio could you clarify wheter you want openEO to acces the EuroCropsDataset or wheter we want to extract S1 and S2 data which match the spatio temporal bounds from the EuroCropsDataset?

I believe openEO would be better suited to:

  1. select a region of interest

  2. define a desired preprocessing methodology (save it as a process graph)

  3. download the preprocessed data

  4. Train the desired model on the data

  5. combine the standardized preprocessing with the model to run inference\

@HansVRP
Copy link
Collaborator

HansVRP commented Nov 5, 2024

@juansensio @Patrick1G any feedback on how best to steer this use-case?

@juansensio
Copy link
Collaborator

Patrick knows more about the use case, but as far as I understand the EuroCrops dataset contain crop classes for parcel polygons, so the goal would be to pair it with additional variables derived from S1/S2 (for example yearly mean NDVI).

openEO should be used to get this variables through a feature engineering pipleine, so we can use them to train a model and then re-use the pipeline at inference time.

Here we can delegate the entire process to openEO, or rely on EOTDL to retrieve the geometries from the STAC catalog and pass them to openEO... I guess the second option is better since we do not need openEO to access the dataset in EOTDL directly (just pass the resulting STAC catalog with geometries).

@juansensio juansensio modified the milestones: v1.10, v1.11 Nov 7, 2024
@Patrick1G
Copy link
Collaborator

Patrick1G commented Nov 8, 2024

@HansVRP @juansensio the use case is described in detail above: - lets follow those steps please

  • indeed the Eurocrops contains many parcel polygons for which we want to create the predictive features from S1 and S2 time series. So with openEO we want to generate the features for each polygon geometry (e.g. via aggregate spatial process)
  • features should be computed at dense temporal intervals (e.g. weekly or 10day - via aggregate temporal). There are two openEO notebooks that to similar feature engineering:
    1. crop type mapping
    1. S1-Stats

Next steps then:

  • training should happen in EOTDL
  • test inference run in CDSE
  • feature engineering process graph to be saved along with trained model in EOTDL - to be reused for inference

Not quite sure how step#2 above should be done?: Eurocrops contains millions of parcel polygons, and to train a model we only need a subset, e.g. contrained to a country, selected crop types and random selection of n polygons within that selection. --- I don't tink openEO provides good functionality to do this, so it could be done in EOTDL with python libraries. As a first step, this could also be done offline.. To be discussed at next meeting..

@HansVRP
Copy link
Collaborator

HansVRP commented Nov 8, 2024

okay already have a first version up on https://github.com/earthpulse/eotdl/tree/hv_openeoexample

Todo

  • properly combine the geometries to reduce the total cost
  • Optimize the openeo settings for data extraction.

@juansensio Does EOTDL has a dedicated cdse s3 storage which we can use to save the results into?

@HansVRP
Copy link
Collaborator

HansVRP commented Nov 8, 2024

@Patrick1G @jdries

For S2 I used Best Available Pixel composites, which create St monthly composites with a minimum amount of clouds. Afterwards I calculated some typical features (percentiles)
https://github.com/earthpulse/eotdl/blob/hv_openeoexample/tutorials/notebooks/openeo/generate_s3_UDP.py

For S1 I used a similar approach
https://github.com/earthpulse/eotdl/blob/hv_openeoexample/tutorials/notebooks/openeo/generate_s1_UDP.py

Please let me know your thoughts

@Patrick1G
Copy link
Collaborator

@HansVRP resources above are not accessible..

But its important to keep the EO science aspects in mind here: we need to generate feature/metrics at a high temporal interval, as this is the critical information for crop type prediction, so 5/7 or 10 day interval metrics, not monthly BAP composites. Therefore I would suggest to use a similar feature engineering approach as above in the S1metrics notebook: {min, mean, mx, stddev, Q25, Q50, Q75, Q90} and generate this for e.g. 10 day interval for the year of the Eurocrops dataset

@HansVRP
Copy link
Collaborator

HansVRP commented Nov 19, 2024

@Patrick1G @jdries please review the current version.

Here I used weekly composites of which I calculate the P10, P25, P50, P75, P90 percentiles.

The statistics can easily be expanded if required. However for now I kept them more limited as I run the statistics across
10 S2 bands, and 2 S1 bands; thereby already resulting in a netCDF with 60 bands.

@juansensio
Copy link
Collaborator

juansensio commented Jan 24, 2025

Notebook updated at https://github.com/earthpulse/eotdl/blob/main/tutorials/usecases/openEO/use_case_1.ipynb

Bloqued by error with openeo, @HansVRP can you provide some feedback ?

Image

Image

Maybe is an issue with openeo versions?

@HansVRP
Copy link
Collaborator

HansVRP commented Jan 24, 2025

which version are you currently using?

@juansensio
Copy link
Collaborator

The error showed uses 0.31.0

I upgraded to 0.37.0 and now see that the first parameter to run_jobs is optional, but still getting the following error:

Image

@HansVRP
Copy link
Collaborator

HansVRP commented Jan 24, 2025

will take a look next week

@juansensio
Copy link
Collaborator

Note: GeoDB only stores the STAC metadata. For the kind of filtering proposed, we need the actual data (crop type), which is not in the STAC metadata. Hence, we will not be able to do this filtering directly with GeoDB nor with the STAC metadata (even locally), so Q1/Q2 will not be useful at all. Discuss in next progress meeting @Patrick1G

@HansVRP HansVRP closed this as completed Jan 27, 2025
@HansVRP
Copy link
Collaborator

HansVRP commented Jan 27, 2025

@juansensio The reason you receive this error is because you have a preexisting database: job_tracker (the jobs.csv file).

During development it is needed to remove the current jobs.csv file, or create a job tracker with a different name, prior to rerunning the cell.

@HansVRP HansVRP assigned juansensio and unassigned HansVRP Jan 27, 2025
@HansVRP
Copy link
Collaborator

HansVRP commented Jan 27, 2025

I'm btw logging an issue to create a clearer user facing warning on this

@juansensio juansensio reopened this Jan 28, 2025
@juansensio
Copy link
Collaborator

I deleted the current jobs.csv and jobs.parquet but still have error @HansVRP

Image

@HansVRP
Copy link
Collaborator

HansVRP commented Jan 28, 2025

This seems to be an issue with the operation being done/used in the start_job.

I believe the geometries passed, are not in the proper geojson format, I'll take a look at the changes you've made and see whether I can have it run on your current input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants