-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add wiki table to complement number of vehicles per country in transport_data csv #1244
base: main
Are you sure you want to change the base?
add wiki table to complement number of vehicles per country in transport_data csv #1244
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great contribution @ljansen-iee :D Many thanks and it is great to have data filling :)
I've added some minor comments; plus I'd advise to add a major contribution into the file doc/release_note.rst mentioning this PR, for example:
"* Add wikipedia source to transport_data `PR #1244 <https://github.com/pypsa-meets-earth/pypsa-earth/pull/1244>`__"
I'd also kindly ask @Eddy-JV to look at this PR as he contributed significantly to this section and his opinion would be extremely valuable.
# Drop region names where country column contains list of countries | ||
df = df[df.country.apply(lambda x: isinstance(x, str))] | ||
|
||
df = df.drop_duplicates(subset=["country"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we drop, should we sum instead? is there a reason?
It can be a TODO also, though a log info may be advisable if that happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drop_duplicates function with keep='first' is only used to keep the WHO GHO list entries and delete the duplicate entries per country in the Wikipedia source. -> I have moved the operation to the vehicle function.
Would you still sum the values based on this information? Why would we do that?
Indeed, the values per country are different more often than expected. For example, WHO-GHO reports 50.6 million registered vehicles in Vietnam, while Wikipedia reports 4.18 million. But I can't say which sources (secondary or primary) are more valid or appropriate.
What would you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.
Have you discussed this method with the others?
except Exception as e: | ||
logger.warning("Failed to read the file:", e) | ||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the exception at the end may mask some errors that happen along the procedure due to change in the data format, rather than access to the source.
That may be less desirable, is there a reason for the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more information, see https://pre-commit.ci
Hey @davide-f, thanks for welcoming the contribution! :) I'm happy to follow the advice to add it to the release notes! I have followed your suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @ljansen-iee :D
Great contribution and thanks for the comments.
I've added few comments to align with you; some points may be left for follow-up if deemed intensive.
Have you discussed this implementation with others? Apologies but I couldn't attend many sec meetings recently, by next year the situation is expected to improve :)
Have you discussed the method with others?
# Drop region names where country column contains list of countries | ||
df = df[df.country.apply(lambda x: isinstance(x, str))] | ||
|
||
df = df.drop_duplicates(subset=["country"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.
Have you discussed this method with the others?
) | ||
|
||
return df[["Country", "number cars"]] | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice-to-have would be moving this try/catch inside the _download function exactly to filter the pd.read_html/pd.read_csv functions.
That allows to well capture the main issue of the problem, what do you think?
To ensure that all countries of interest are available, we can add a final check and if countries are missing, an error can be thrown [or an empty dataframe]
The use of try/catch can easily silent issues that may be relevant to tackle, such as data changes in the workflow.
That especially in the case of the filtering that relies on keep='first'.
Moreover, if just one of the 2 sources is offline, the whole method is no longer used.
Currently, the number of registered vehicles per country (part of resources/transport_data.csv) is downloaded from the WHO Global Health Observatory data repository. A few countries are missing in the WHO list (e.g. South Africa, Algeria).
Changes proposed in this Pull Request
Checklist
envs/environment.yaml
anddoc/requirements.txt
.config.default.yaml
andconfig.tutorial.yaml
.test/
(note tests are changing the config.tutorial.yaml)doc/configtables/*.csv
and line references are adjusted indoc/configuration.rst
anddoc/tutorial.rst
.doc/release_notes.rst
is amended in the format of previous release notes, including reference to the requested PR.