Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggest dataset title depending on content #23

Open
jdries opened this issue Nov 27, 2024 · 16 comments
Open

suggest dataset title depending on content #23

jdries opened this issue Nov 27, 2024 · 16 comments
Assignees

Comments

@jdries
Copy link

jdries commented Nov 27, 2024

As the collection id will now be a random string, we need a human readable title which aligns with our dataset naming convention:

<short_dataset_description><information_content>

year -> to be derived from valid_time
region -> to be derived from dataset extent --> convert to 3 letter country or region code
short dataset description -> can initially be filename of uploaded dataset. The user is allowed to change this part and ONLY this part
type -> POINT or POLY
information_content -> 3 digit code specifying presence of land cover, crop type and irrigation labels

Also tagging @cbutsko here...

@jdegerickx
Copy link

I will try to come up with an initial Python script to do this...

@jdegerickx
Copy link

plan is to add a new attribute to the RDM called "ref_id", which should host this information and the user can still edit the dataset title freely...

@santoshkaranam
Copy link
Collaborator

@jdegerickx I am not sure of a library which can return me the region based on the location. can you suggest?
Rest of the items i can generate from dataset.

@jdegerickx
Copy link

@santoshkaranam, @cbutsko, I created a script to determine ref_id for a new dataset automatically --> https://github.com/WorldCereal/worldcereal-referencedata/blob/main/harmonization/assign_ref_id.py
As input you currently need a path to a file (can be replaced with geodataframe) and a custom part of the ref_id (e.g. LPIS).
Please have a look and check if it makes sense.

@cbutsko
Copy link

cbutsko commented Dec 3, 2024

@jdegerickx from the first glance looks fine to me! do you mind if I take your code and refactor it to a FastAPI call, just like the other calls we have here?
@santoshkaranam Santosh, I guess for you it will also be the easiest way to use it if you can call it like that?
and I will need either a path so that I can read the processed dataset (just as in reprojection function), or maybe just a collection_id, and we can just query everything we need for the title/ref_id creation from the RDM directly.
Let me know what you guys think.

@jdegerickx
Copy link

thanks @cbutsko ! Yes, feel free to convert to fastAPI so Santosh can actually use it ;)

@santoshkaranam
Copy link
Collaborator

@jdegerickx @cbutsko, if there is a shape file with countries, it would be easy for me to just make it a parquet file and put in cloud storage and use duckdb to query with a point from user dataset (or centroid point of 1st polygon) and get the region letters. It would be Kubernetes container friendly also. Referring to local files in a container is a bit tricky and increases the size of the deployed container.

The rest of the things I already have in my code. Year, Point/Polygon, fileName part (First 5–10 Letters may be) and Crop code and Irrigation Codes. Also, I need to check in the database if it is a duplicate and append numbers only to the custom part.

Can you share the shape file With regions? I can put in cloud storage and provide the link. Then we can write python code to query and return region letters and convert to fastAPI.

@jdegerickx
Copy link

@santoshkaranam, the shapefile is located on terrascope. You can find the path in the code I shared, you should be able to access it.
Just be careful with querying for the country code using a single point only, as the dataset might cover multiple countries or even the whole globe... In the code I provided, I first compute a union on all features before determining the country code...

@cbutsko
Copy link

cbutsko commented Dec 9, 2024

added FastAPI call to retrieve dataset's country/region code https://github.com/WorldCereal/worldcereal-referencedata/commit/f613347e63bed78054471f0fb38dbcc0667a374d

@santoshkaranam
Copy link
Collaborator

@cbutsko will integrate this to add new Title.

@santoshkaranam
Copy link
Collaborator

Currently, generating the title and collection IDs based on the format is done automatically. For duplicate names, the identifier is appended with numbers.
Title can also be edited to later as of now.

Sample duplicates.
image

@santoshkaranam
Copy link
Collaborator

image

@jdries
Copy link
Author

jdries commented Dec 11, 2024

to be tested

@jdegerickx
Copy link

@santoshkaranam, please ping me here as soon as you have adjusted this to show the organization instead of a random string, then I will proceed testing everything... Thanks!

@santoshkaranam
Copy link
Collaborator

@jdegerickx done

image

@jdegerickx
Copy link

@santoshkaranam, I have tested the procedure.
One remaining issue relates to automated assigning of sample_id to all samples in a dataset.
Right now, you only use the unique identifier as a basis for the sample_id. This results in sample_id's like:
Image
Which is not that informative...

The preferred way here is that you actually use the full collection ID as the basis for the sample id, so we would have something like:
2022_glo_vito_point_100-1
2022_glo_vito_point_100-2
...
(also note the hyphen, which makes it easy to read)

Could this be adjusted?

Also this way, the identifier does not need to be unique, which is a huge advantage for the user!
(if you agree, can you remove the "Unique" in the guideline?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants