-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggest dataset title depending on content #23
Comments
I will try to come up with an initial Python script to do this... |
plan is to add a new attribute to the RDM called "ref_id", which should host this information and the user can still edit the dataset title freely... |
@jdegerickx I am not sure of a library which can return me the region based on the location. can you suggest? |
@santoshkaranam, @cbutsko, I created a script to determine ref_id for a new dataset automatically --> https://github.com/WorldCereal/worldcereal-referencedata/blob/main/harmonization/assign_ref_id.py |
@jdegerickx from the first glance looks fine to me! do you mind if I take your code and refactor it to a FastAPI call, just like the other calls we have here? |
thanks @cbutsko ! Yes, feel free to convert to fastAPI so Santosh can actually use it ;) |
@jdegerickx @cbutsko, if there is a shape file with countries, it would be easy for me to just make it a parquet file and put in cloud storage and use duckdb to query with a point from user dataset (or centroid point of 1st polygon) and get the region letters. It would be Kubernetes container friendly also. Referring to local files in a container is a bit tricky and increases the size of the deployed container. The rest of the things I already have in my code. Year, Point/Polygon, fileName part (First 5–10 Letters may be) and Crop code and Irrigation Codes. Also, I need to check in the database if it is a duplicate and append numbers only to the custom part. Can you share the shape file With regions? I can put in cloud storage and provide the link. Then we can write python code to query and return region letters and convert to fastAPI. |
@santoshkaranam, the shapefile is located on terrascope. You can find the path in the code I shared, you should be able to access it. |
added FastAPI call to retrieve dataset's country/region code https://github.com/WorldCereal/worldcereal-referencedata/commit/f613347e63bed78054471f0fb38dbcc0667a374d |
@cbutsko will integrate this to add new Title. |
to be tested |
@santoshkaranam, please ping me here as soon as you have adjusted this to show the organization instead of a random string, then I will proceed testing everything... Thanks! |
@jdegerickx done |
@santoshkaranam, I have tested the procedure. The preferred way here is that you actually use the full collection ID as the basis for the sample id, so we would have something like: Could this be adjusted? Also this way, the identifier does not need to be unique, which is a huge advantage for the user! |
As the collection id will now be a random string, we need a human readable title which aligns with our dataset naming convention:
<short_dataset_description><information_content>
year -> to be derived from valid_time
region -> to be derived from dataset extent --> convert to 3 letter country or region code
short dataset description -> can initially be filename of uploaded dataset. The user is allowed to change this part and ONLY this part
type -> POINT or POLY
information_content -> 3 digit code specifying presence of land cover, crop type and irrigation labels
Also tagging @cbutsko here...
The text was updated successfully, but these errors were encountered: