add wiki table to complement number of vehicles per country in transport_data csv #1244

ljansen-iee · 2024-12-19T16:33:03Z

Currently, the number of registered vehicles per country (part of resources/transport_data.csv) is downloaded from the WHO Global Health Observatory data repository. A few countries are missing in the WHO list (e.g. South Africa, Algeria).

Changes proposed in this Pull Request

Fill in the missing countries with the number of vehicles per country from Wikipedia 'https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_motor_vehicles_per_capita'
The proposed concatenation of downloaded vehicle data will prioritise the WHO list -> existing model instances for previously available countries will not be affected.
Refactoring and restructuring of the code in the prepare_transport_data_input.py script

Checklist

I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
I tested my contribution locally and it seems to work fine:
- With the changes, the code behaves as expected: Number of vehicles for a few more countries are added. However, it seems that the World Bank's CO2_emission API changed in November and the requested data, CO2 emissions from transport in % of total fuel combustion, is no longer available. Since this is not part of the proposed contribution of this pull request, I'll write a separate issue about it.
Code and workflow changes are sufficiently documented.
Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

…t data script

for more information, see https://pre-commit.ci

davide-f

Great contribution @ljansen-iee :D Many thanks and it is great to have data filling :)

I've added some minor comments; plus I'd advise to add a major contribution into the file doc/release_note.rst mentioning this PR, for example:
"* Add wikipedia source to transport_data `PR #1244 <https://github.com/pypsa-meets-earth/pypsa-earth/pull/1244>`__"

I'd also kindly ask @Eddy-JV to look at this PR as he contributed significantly to this section and his opinion would be extremely valuable.

scripts/prepare_transport_data_input.py

davide-f · 2024-12-19T20:43:49Z

scripts/prepare_transport_data_input.py

+    # Drop region names where country column contains list of countries
+    df = df[df.country.apply(lambda x: isinstance(x, str))]
+
+    df = df.drop_duplicates(subset=["country"])


Here we drop, should we sum instead? is there a reason?
It can be a TODO also, though a log info may be advisable if that happens

The drop_duplicates function with keep='first' is only used to keep the WHO GHO list entries and delete the duplicate entries per country in the Wikipedia source. -> I have moved the operation to the vehicle function.

Would you still sum the values based on this information? Why would we do that?
Indeed, the values per country are different more often than expected. For example, WHO-GHO reports 50.6 million registered vehicles in Vietnam, while Wikipedia reports 4.18 million. But I can't say which sources (secondary or primary) are more valid or appropriate.
What would you recommend?

I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.

Have you discussed this method with the others?

davide-f · 2024-12-19T20:46:41Z

scripts/prepare_transport_data_input.py

+    except Exception as e:
+        logger.warning("Failed to read the file:", e)
+        return pd.DataFrame()


Moving the exception at the end may mask some errors that happen along the procedure due to change in the data format, rather than access to the source.
That may be less desirable, is there a reason for the proposal?

Hey, good point, I thought it would be more desirable, but hadn't given it much thought. I've limited the exception handling to the download part again - as originally introduced by @Eddy-JV and @GbotemiB.

for more information, see https://pre-commit.ci

ljansen-iee · 2024-12-20T16:51:18Z

Hey @davide-f, thanks for welcoming the contribution! :) I'm happy to follow the advice to add it to the release notes! I have followed your suggestion.

davide-f

Hello @ljansen-iee :D
Great contribution and thanks for the comments.

I've added few comments to align with you; some points may be left for follow-up if deemed intensive.
Have you discussed this implementation with others? Apologies but I couldn't attend many sec meetings recently, by next year the situation is expected to improve :)

Have you discussed the method with others?

davide-f · 2024-12-21T11:59:30Z

scripts/prepare_transport_data_input.py

+    # Drop region names where country column contains list of countries
+    df = df[df.country.apply(lambda x: isinstance(x, str))]
+
+    df = df.drop_duplicates(subset=["country"])


I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.

Have you discussed this method with the others?

davide-f · 2024-12-21T12:05:48Z

scripts/prepare_transport_data_input.py

+        )
+
+        return df[["Country", "number cars"]]
+
    try:


A nice-to-have would be moving this try/catch inside the _download function exactly to filter the pd.read_html/pd.read_csv functions.
That allows to well capture the main issue of the problem, what do you think?
To ensure that all countries of interest are available, we can add a final check and if countries are missing, an error can be thrown [or an empty dataframe]

The use of try/catch can easily silent issues that may be relevant to tackle, such as data changes in the workflow.
That especially in the case of the filtering that relies on keep='first'.
Moreover, if just one of the 2 sources is offline, the whole method is no longer used.

ljansen-iee and others added 2 commits December 19, 2024 16:36

add wiki table to complement number of vehicles; refactoring transpor…

e084440

…t data script

[pre-commit.ci] auto fixes from pre-commit.com hooks

74d6732

for more information, see https://pre-commit.ci

davide-f reviewed Dec 19, 2024

View reviewed changes

ljansen-iee and others added 5 commits December 20, 2024 13:32

add minor doc strings; move drop_duplicates and exception

b8746c0

add PR pypsa-meets-earth#1244 to release_notes as advised

4ef4556

merge some auto-fixes

c665309

Merge branch 'main' into add-vehicles_data_from_wiki

438e4db

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff419fa

for more information, see https://pre-commit.ci

davide-f reviewed Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add wiki table to complement number of vehicles per country in transport_data csv #1244

add wiki table to complement number of vehicles per country in transport_data csv #1244

ljansen-iee commented Dec 19, 2024 •

edited

Loading

davide-f left a comment

davide-f Dec 19, 2024

ljansen-iee Dec 20, 2024 •

edited

Loading

davide-f Dec 21, 2024

davide-f Dec 19, 2024

ljansen-iee Dec 20, 2024

ljansen-iee commented Dec 20, 2024

davide-f left a comment

davide-f Dec 21, 2024

davide-f Dec 21, 2024

add wiki table to complement number of vehicles per country in transport_data csv #1244

Are you sure you want to change the base?

add wiki table to complement number of vehicles per country in transport_data csv #1244

Conversation

ljansen-iee commented Dec 19, 2024 • edited Loading

Changes proposed in this Pull Request

Checklist

davide-f left a comment

Choose a reason for hiding this comment

davide-f Dec 19, 2024

Choose a reason for hiding this comment

ljansen-iee Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

davide-f Dec 21, 2024

Choose a reason for hiding this comment

davide-f Dec 19, 2024

Choose a reason for hiding this comment

ljansen-iee Dec 20, 2024

Choose a reason for hiding this comment

ljansen-iee commented Dec 20, 2024

davide-f left a comment

Choose a reason for hiding this comment

davide-f Dec 21, 2024

Choose a reason for hiding this comment

davide-f Dec 21, 2024

Choose a reason for hiding this comment

ljansen-iee commented Dec 19, 2024 •

edited

Loading

ljansen-iee Dec 20, 2024 •

edited

Loading