Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDI post-cleanup #3939

Open
1 of 5 tasks
Marigold opened this issue Feb 6, 2025 · 2 comments
Open
1 of 5 tasks

WDI post-cleanup #3939

Marigold opened this issue Feb 6, 2025 · 2 comments

Comments

@Marigold
Copy link
Collaborator

Marigold commented Feb 6, 2025

Here are some notes from the WB discussion about WDI and from our chat prior to it:

  • Clarify where we get metadata from. It looks like we're fetching it dynamically in the garden step, but in reality, we get it from a .zip file.
    • (Pablo) My suggestion would be that, in the meadow step, we store two tables: one for the main data, and one for the metadata. This way, in the garden step, there's no need to load_snapshot again. The garden step can simply depend on meadow step, which has two tables. We do something similar with FAOSTAT (although there the metadata is a separate dataset, which I think is a worse option).
  • Rethink how we cite WB and their underlying sources. Right now, we take the full source from WDI and shorten it with the help of GPT (see wdi.sources.json and update_metadata.ipynb), but it's inconsistent and doesn't follow our best practices (which have changed since switching from sources to origins). We could also extract more information into the origins fields.
    • (Pablo) Creating an appropriate short citation is an important point, but it sounds like it has already been handled? Additionally, we could use their long citations into our citation_full, with as much detail as they provide.
  • The Statistical Capacity Indicator was replaced by "Statistical Performance" (migrate indicators from the old version).
    • (Pablo) I'll ping Bastian on this specific issue.
  • They've started publishing Release Notes.
    • (Pablo) Not sure what the action item is here. Maybe simply mention it in the docstring of the snapshot step?
  • Describe the updating process in the README or in the snapshot docstring. If we find any quality issues, we should report them to [email protected] and [email protected].
    • (Pablo) Good idea. We don't have any readme for WDI in docs/data, so I suppose the appropriate place to describe the update procedure and contact persons could be the docstring of the snapshot .py file.
@pabloarosado
Copy link
Contributor

Rated as priority 2 because it's fresh on our minds and we have momentum. It would be more effort in the future.

@pabloarosado pabloarosado assigned pabloarosado and unassigned Marigold Feb 6, 2025
@pabloarosado
Copy link
Contributor

pabloarosado commented Feb 6, 2025

Hi @Marigold thanks for summarizing the main conclusions in this issue. I'm waiting for others' feedback on this thread to write back to WDI.
I've added a few comments on the description above. Feel free to take over those tasks, given that you understand the details of this pipeline better. Otherwise I can delve into it in the coming weeks. And let me know if I can help, thanks!

PS: I'll merge my PR later today, to avoid blocking ETL in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants