Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add wiki table to complement number of vehicles per country in transport_data csv #1244
base: main
Are you sure you want to change the base?
add wiki table to complement number of vehicles per country in transport_data csv #1244
Changes from 2 commits
e084440
74d6732
b8746c0
4ef4556
c665309
438e4db
ff419fa
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we drop, should we sum instead? is there a reason?
It can be a TODO also, though a log info may be advisable if that happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drop_duplicates function with keep='first' is only used to keep the WHO GHO list entries and delete the duplicate entries per country in the Wikipedia source. -> I have moved the operation to the vehicle function.
Would you still sum the values based on this information? Why would we do that?
Indeed, the values per country are different more often than expected. For example, WHO-GHO reports 50.6 million registered vehicles in Vietnam, while Wikipedia reports 4.18 million. But I can't say which sources (secondary or primary) are more valid or appropriate.
What would you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.
Have you discussed this method with the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice-to-have would be moving this try/catch inside the _download function exactly to filter the pd.read_html/pd.read_csv functions.
That allows to well capture the main issue of the problem, what do you think?
To ensure that all countries of interest are available, we can add a final check and if countries are missing, an error can be thrown [or an empty dataframe]
The use of try/catch can easily silent issues that may be relevant to tackle, such as data changes in the workflow.
That especially in the case of the filtering that relies on keep='first'.
Moreover, if just one of the 2 sources is offline, the whole method is no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the exception at the end may mask some errors that happen along the procedure due to change in the data format, rather than access to the source.
That may be less desirable, is there a reason for the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, good point, I thought it would be more desirable, but hadn't given it much thought. I've limited the exception handling to the download part again - as originally introduced by @Eddy-JV and @GbotemiB.