Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add wiki table to complement number of vehicles per country in transport_data csv #1244

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ljansen-iee
Copy link

@ljansen-iee ljansen-iee commented Dec 19, 2024

Currently, the number of registered vehicles per country (part of resources/transport_data.csv) is downloaded from the WHO Global Health Observatory data repository. A few countries are missing in the WHO list (e.g. South Africa, Algeria).

Changes proposed in this Pull Request

Checklist

  • I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
  • I tested my contribution locally and it seems to work fine:
    • With the changes, the code behaves as expected: Number of vehicles for a few more countries are added. However, it seems that the World Bank's CO2_emission API changed in November and the requested data, CO2 emissions from transport in % of total fuel combustion, is no longer available. Since this is not part of the proposed contribution of this pull request, I'll write a separate issue about it.
  • Code and workflow changes are sufficiently documented.
  • Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
  • Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
  • Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
  • Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
  • A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution @ljansen-iee :D Many thanks and it is great to have data filling :)

I've added some minor comments; plus I'd advise to add a major contribution into the file doc/release_note.rst mentioning this PR, for example:
"* Add wikipedia source to transport_data `PR #1244 <https://github.com/pypsa-meets-earth/pypsa-earth/pull/1244>`__"

I'd also kindly ask @Eddy-JV to look at this PR as he contributed significantly to this section and his opinion would be extremely valuable.

scripts/prepare_transport_data_input.py Show resolved Hide resolved
# Drop region names where country column contains list of countries
df = df[df.country.apply(lambda x: isinstance(x, str))]

df = df.drop_duplicates(subset=["country"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we drop, should we sum instead? is there a reason?
It can be a TODO also, though a log info may be advisable if that happens

Copy link
Author

@ljansen-iee ljansen-iee Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop_duplicates function with keep='first' is only used to keep the WHO GHO list entries and delete the duplicate entries per country in the Wikipedia source. -> I have moved the operation to the vehicle function.

Would you still sum the values based on this information? Why would we do that?
Indeed, the values per country are different more often than expected. For example, WHO-GHO reports 50.6 million registered vehicles in Vietnam, while Wikipedia reports 4.18 million. But I can't say which sources (secondary or primary) are more valid or appropriate.
What would you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.

Have you discussed this method with the others?

Comment on lines 134 to 136
except Exception as e:
logger.warning("Failed to read the file:", e)
return pd.DataFrame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the exception at the end may mask some errors that happen along the procedure due to change in the data format, rather than access to the source.
That may be less desirable, is there a reason for the proposal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, good point, I thought it would be more desirable, but hadn't given it much thought. I've limited the exception handling to the download part again - as originally introduced by @Eddy-JV and @GbotemiB.

@ljansen-iee
Copy link
Author

Hey @davide-f, thanks for welcoming the contribution! :) I'm happy to follow the advice to add it to the release notes! I have followed your suggestion.

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ljansen-iee :D
Great contribution and thanks for the comments.

I've added few comments to align with you; some points may be left for follow-up if deemed intensive.
Have you discussed this implementation with others? Apologies but I couldn't attend many sec meetings recently, by next year the situation is expected to improve :)

Have you discussed the method with others?

# Drop region names where country column contains list of countries
df = df[df.country.apply(lambda x: isinstance(x, str))]

df = df.drop_duplicates(subset=["country"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.

Have you discussed this method with the others?

)

return df[["Country", "number cars"]]

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice-to-have would be moving this try/catch inside the _download function exactly to filter the pd.read_html/pd.read_csv functions.
That allows to well capture the main issue of the problem, what do you think?
To ensure that all countries of interest are available, we can add a final check and if countries are missing, an error can be thrown [or an empty dataframe]

The use of try/catch can easily silent issues that may be relevant to tackle, such as data changes in the workflow.
That especially in the case of the filtering that relies on keep='first'.
Moreover, if just one of the 2 sources is offline, the whole method is no longer used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants