suggest dataset title depending on content #23

jdries · 2024-11-27T12:13:29Z

As the collection id will now be a random string, we need a human readable title which aligns with our dataset naming convention:

<short_dataset_description><information_content>

year -> to be derived from valid_time
region -> to be derived from dataset extent --> convert to 3 letter country or region code
short dataset description -> can initially be filename of uploaded dataset. The user is allowed to change this part and ONLY this part
type -> POINT or POLY
information_content -> 3 digit code specifying presence of land cover, crop type and irrigation labels

Also tagging @cbutsko here...

jdegerickx · 2024-12-02T16:41:15Z

I will try to come up with an initial Python script to do this...

jdegerickx · 2024-12-02T16:43:36Z

plan is to add a new attribute to the RDM called "ref_id", which should host this information and the user can still edit the dataset title freely...

santoshkaranam · 2024-12-03T10:01:40Z

@jdegerickx I am not sure of a library which can return me the region based on the location. can you suggest?
Rest of the items i can generate from dataset.

jdegerickx · 2024-12-03T14:38:48Z

@santoshkaranam, @cbutsko, I created a script to determine ref_id for a new dataset automatically --> https://github.com/WorldCereal/worldcereal-referencedata/blob/main/harmonization/assign_ref_id.py
As input you currently need a path to a file (can be replaced with geodataframe) and a custom part of the ref_id (e.g. LPIS).
Please have a look and check if it makes sense.

cbutsko · 2024-12-03T14:54:59Z

@jdegerickx from the first glance looks fine to me! do you mind if I take your code and refactor it to a FastAPI call, just like the other calls we have here?
@santoshkaranam Santosh, I guess for you it will also be the easiest way to use it if you can call it like that?
and I will need either a path so that I can read the processed dataset (just as in reprojection function), or maybe just a collection_id, and we can just query everything we need for the title/ref_id creation from the RDM directly.
Let me know what you guys think.

jdegerickx · 2024-12-03T15:51:01Z

thanks @cbutsko ! Yes, feel free to convert to fastAPI so Santosh can actually use it ;)

santoshkaranam · 2024-12-03T15:53:06Z

@jdegerickx @cbutsko, if there is a shape file with countries, it would be easy for me to just make it a parquet file and put in cloud storage and use duckdb to query with a point from user dataset (or centroid point of 1st polygon) and get the region letters. It would be Kubernetes container friendly also. Referring to local files in a container is a bit tricky and increases the size of the deployed container.

The rest of the things I already have in my code. Year, Point/Polygon, fileName part (First 5–10 Letters may be) and Crop code and Irrigation Codes. Also, I need to check in the database if it is a duplicate and append numbers only to the custom part.

Can you share the shape file With regions? I can put in cloud storage and provide the link. Then we can write python code to query and return region letters and convert to fastAPI.

jdegerickx · 2024-12-03T16:00:39Z

@santoshkaranam, the shapefile is located on terrascope. You can find the path in the code I shared, you should be able to access it.
Just be careful with querying for the country code using a single point only, as the dataset might cover multiple countries or even the whole globe... In the code I provided, I first compute a union on all features before determining the country code...

cbutsko · 2024-12-09T09:25:16Z

added FastAPI call to retrieve dataset's country/region code https://github.com/WorldCereal/worldcereal-referencedata/commit/f613347e63bed78054471f0fb38dbcc0667a374d

santoshkaranam · 2024-12-09T09:36:44Z

@cbutsko will integrate this to add new Title.

santoshkaranam · 2024-12-11T10:58:45Z

Currently, generating the title and collection IDs based on the format is done automatically. For duplicate names, the identifier is appended with numbers.
Title can also be edited to later as of now.

Sample duplicates.

santoshkaranam · 2024-12-11T12:15:45Z

jdries · 2024-12-11T12:18:02Z

to be tested

jdegerickx · 2024-12-11T12:36:23Z

@santoshkaranam, please ping me here as soon as you have adjusted this to show the organization instead of a random string, then I will proceed testing everything... Thanks!

santoshkaranam · 2024-12-11T12:49:21Z

@jdegerickx done

jdegerickx · 2024-12-16T16:10:09Z

@santoshkaranam, I have tested the procedure.
One remaining issue relates to automated assigning of sample_id to all samples in a dataset.
Right now, you only use the unique identifier as a basis for the sample_id. This results in sample_id's like:

Which is not that informative...

The preferred way here is that you actually use the full collection ID as the basis for the sample id, so we would have something like:
2022_glo_vito_point_100-1
2022_glo_vito_point_100-2
...
(also note the hyphen, which makes it easy to read)

Could this be adjusted?

Also this way, the identifier does not need to be unique, which is a huge advantage for the user!
(if you agree, can you remove the "Unique" in the guideline?

jdries assigned santoshkaranam Nov 27, 2024

jdegerickx self-assigned this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggest dataset title depending on content #23

suggest dataset title depending on content #23

jdries commented Nov 27, 2024 •

edited by jdegerickx

Loading

jdegerickx commented Dec 2, 2024

jdegerickx commented Dec 2, 2024

santoshkaranam commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

cbutsko commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

santoshkaranam commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

cbutsko commented Dec 9, 2024 •

edited

Loading

santoshkaranam commented Dec 9, 2024

santoshkaranam commented Dec 11, 2024

santoshkaranam commented Dec 11, 2024

jdries commented Dec 11, 2024

jdegerickx commented Dec 11, 2024

santoshkaranam commented Dec 11, 2024

jdegerickx commented Dec 16, 2024

suggest dataset title depending on content #23

suggest dataset title depending on content #23

Comments

jdries commented Nov 27, 2024 • edited by jdegerickx Loading

jdegerickx commented Dec 2, 2024

jdegerickx commented Dec 2, 2024

santoshkaranam commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

cbutsko commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

santoshkaranam commented Dec 3, 2024

jdegerickx commented Dec 3, 2024

cbutsko commented Dec 9, 2024 • edited Loading

santoshkaranam commented Dec 9, 2024

santoshkaranam commented Dec 11, 2024

santoshkaranam commented Dec 11, 2024

jdries commented Dec 11, 2024

jdegerickx commented Dec 11, 2024

santoshkaranam commented Dec 11, 2024

jdegerickx commented Dec 16, 2024

jdries commented Nov 27, 2024 •

edited by jdegerickx

Loading

cbutsko commented Dec 9, 2024 •

edited

Loading