WDI post-cleanup #3939

Marigold · 2025-02-06T09:25:12Z

Here are some notes from the WB discussion about WDI and from our chat prior to it:

Clarify where we get metadata from. It looks like we're fetching it dynamically in the garden step, but in reality, we get it from a .zip file.
- (Pablo) My suggestion would be that, in the meadow step, we store two tables: one for the main data, and one for the metadata. This way, in the garden step, there's no need to load_snapshot again. The garden step can simply depend on meadow step, which has two tables. We do something similar with FAOSTAT (although there the metadata is a separate dataset, which I think is a worse option).
Rethink how we cite WB and their underlying sources. Right now, we take the full source from WDI and shorten it with the help of GPT (see wdi.sources.json and update_metadata.ipynb), but it's inconsistent and doesn't follow our best practices (which have changed since switching from sources to origins). We could also extract more information into the origins fields.
- (Pablo) Creating an appropriate short citation is an important point, but it sounds like it has already been handled? Additionally, we could use their long citations into our citation_full, with as much detail as they provide.
The Statistical Capacity Indicator was replaced by "Statistical Performance" (migrate indicators from the old version).
- (Pablo) I'll ping Bastian on this specific issue.
They've started publishing Release Notes.
- (Pablo) Not sure what the action item is here. Maybe simply mention it in the docstring of the snapshot step?
Describe the updating process in the README or in the snapshot docstring. If we find any quality issues, we should report them to [email protected] and [email protected].
- (Pablo) Good idea. We don't have any readme for WDI in docs/data, so I suppose the appropriate place to describe the update procedure and contact persons could be the docstring of the snapshot .py file.

The text was updated successfully, but these errors were encountered:

pabloarosado · 2025-02-06T10:17:09Z

Rated as priority 2 because it's fresh on our minds and we have momentum. It would be more effort in the future.

pabloarosado · 2025-02-06T16:02:41Z

Hi @Marigold thanks for summarizing the main conclusions in this issue. I'm waiting for others' feedback on this thread to write back to WDI.
I've added a few comments on the description above. Feel free to take over those tasks, given that you understand the details of this pipeline better. Otherwise I can delve into it in the coming weeks. And let me know if I can help, thanks!

PS: I'll merge my PR later today, to avoid blocking ETL in production.

Marigold self-assigned this Feb 6, 2025

github-actions bot added the needs triage label Feb 6, 2025

pabloarosado added priority 2 - important and removed needs triage labels Feb 6, 2025

pabloarosado assigned pabloarosado and unassigned Marigold Feb 6, 2025

pabloarosado assigned Marigold Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WDI post-cleanup #3939

WDI post-cleanup #3939

Marigold commented Feb 6, 2025 •

edited by pabloarosado

Loading

pabloarosado commented Feb 6, 2025

pabloarosado commented Feb 6, 2025 •

edited

Loading

WDI post-cleanup #3939

WDI post-cleanup #3939

Comments

Marigold commented Feb 6, 2025 • edited by pabloarosado Loading

pabloarosado commented Feb 6, 2025

pabloarosado commented Feb 6, 2025 • edited Loading

Marigold commented Feb 6, 2025 •

edited by pabloarosado

Loading

pabloarosado commented Feb 6, 2025 •

edited

Loading