Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add wiki table to complement number of vehicles per country in transport_data csv #1244

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions doc/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This part of documentation collects descriptive release notes to capture the mai

* Include option in the config to allow for custom airport data `PR #1241 <https://github.com/pypsa-meets-earth/pypsa-earth/pull/1241>`__

* Add wikipedia source to transport_data `PR #1244 <https://github.com/pypsa-meets-earth/pypsa-earth/pull/1244>`__

**Minor Changes and bug-fixing**

Expand Down
159 changes: 92 additions & 67 deletions scripts/prepare_transport_data_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,56 +12,93 @@
import pandas as pd
from _helpers import BASE_DIR

# logger = logging.getLogger(__name__)
logger = logging.getLogger(__name__)


def download_number_of_vehicles():
def _add_iso2_code_per_country_and_clean_data(df):
"""
Converts 'Country' names to ISO2 codes in a new 'country' column.
Cleans DataFrame by removing rows with invalid 'country' values.
"""
Downloads the Number of registered vehicles as .csv File.

davide-f marked this conversation as resolved.
Show resolved Hide resolved
The following csv file was downloaded from the webpage
https://apps.who.int/gho/data/node.main.A995
as a .csv file.
cc = coco.CountryConverter()
df["country"] = cc.pandas_convert(
series=pd.Series(df["Country"]), to="ISO2", not_found="not found"
)

df = df[df.country != "not found"]

# Drop region names where country column contains list of countries
df = df[df.country.apply(lambda x: isinstance(x, str))]

df = df.drop_duplicates(subset=["country"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we drop, should we sum instead? is there a reason?
It can be a TODO also, though a log info may be advisable if that happens

Copy link
Author

@ljansen-iee ljansen-iee Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop_duplicates function with keep='first' is only used to keep the WHO GHO list entries and delete the duplicate entries per country in the Wikipedia source. -> I have moved the operation to the vehicle function.

Would you still sum the values based on this information? Why would we do that?
Indeed, the values per country are different more often than expected. For example, WHO-GHO reports 50.6 million registered vehicles in Vietnam, while Wikipedia reports 4.18 million. But I can't say which sources (secondary or primary) are more valid or appropriate.
What would you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now understand the process; it is a bit misleading but works: this function does not clean each source but the whole dataframe.
It implicitly assumes that as WHO are always first, then those inputs are kept first.
It would be better to select those inputs from Wikipedia that are missing in the first dataframe for stability/code robustness, yet currently it works.
The keep='first' is not explicit, so even the change in the default implementation of the function can lead to change of the results that would be completely silent in the workflow.

Have you discussed this method with the others?


return df


def download_number_of_vehicles():
"""
Downloads and returns the number of registered vehicles
as tabular data from WHO and Wikipedia.

The csv data from the WHO website is imported
from 'https://apps.who.int/gho/data/node.main.A995'.
A few countries are missing in the WHO list (e.g. South Africa, Algeria).
Therefore, the number of vehicles per country table from Wikipedia
is also imported for completion (prio 2):
'https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_motor_vehicles_per_capita'.
"""
fn = "https://apps.who.int/gho/athena/data/GHO/RS_194?filter=COUNTRY:*&ead=&x-sideaxis=COUNTRY;YEAR;DATASOURCE&x-topaxis=GHO&profile=crosstable&format=csv"
storage_options = {"User-Agent": "Mozilla/5.0"}

# Read the 'Data' sheet directly from the csv file at the provided URL
def _download_vehicles_data_from_gho():
url = "https://apps.who.int/gho/athena/data/GHO/RS_194?filter=COUNTRY:*&ead=&x-sideaxis=COUNTRY;YEAR;DATASOURCE&x-topaxis=GHO&profile=crosstable&format=csv"
storage_options = {"User-Agent": "Mozilla/5.0"}
df = pd.read_csv(url, storage_options=storage_options, encoding="utf8")

df.rename(
columns={
"Countries, territories and areas": "Country",
"Number of registered vehicles": "number cars",
},
inplace=True,
)

df["number cars"] = df["number cars"].str.replace(" ", "").replace("", np.nan)

df = df.dropna(subset=["number cars"])

return df[["Country", "number cars"]]

def _download_vehicles_data_from_wiki():
url = "https://en.wikipedia.org/wiki/List_of_countries_and_territories_by_motor_vehicles_per_capita"
df = pd.read_html(url)[0]

df.rename(
columns={"Location": "Country", "Vehicles": "number cars"}, inplace=True
)

return df[["Country", "number cars"]]

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice-to-have would be moving this try/catch inside the _download function exactly to filter the pd.read_html/pd.read_csv functions.
That allows to well capture the main issue of the problem, what do you think?
To ensure that all countries of interest are available, we can add a final check and if countries are missing, an error can be thrown [or an empty dataframe]

The use of try/catch can easily silent issues that may be relevant to tackle, such as data changes in the workflow.
That especially in the case of the filtering that relies on keep='first'.
Moreover, if just one of the 2 sources is offline, the whole method is no longer used.

Nbr_vehicles_csv = pd.read_csv(
fn, storage_options=storage_options, encoding="utf8"
nbr_vehicles = pd.concat(
[_download_vehicles_data_from_gho(), _download_vehicles_data_from_wiki()],
ignore_index=True,
)
print("File read successfully.")
except Exception as e:
print("Failed to read the file:", e)
logger.warning(
"Failed to read the file:",
e,
"\nReturning an empty df and falling back on the hard-coded data.",
)
return pd.DataFrame()

Nbr_vehicles_csv = Nbr_vehicles_csv.rename(
columns={
"Countries, territories and areas": "Country",
"Number of registered vehicles": "number cars",
}
)
nbr_vehicles["number cars"] = nbr_vehicles["number cars"].astype(int)

# Add ISO2 country code for each country
cc = coco.CountryConverter()
Country = pd.Series(Nbr_vehicles_csv["Country"])
Nbr_vehicles_csv["country"] = cc.pandas_convert(
series=Country, to="ISO2", not_found="not found"
)
nbr_vehicles = _add_iso2_code_per_country_and_clean_data(nbr_vehicles)

# # Remove spaces, Replace empty values with NaN
Nbr_vehicles_csv["number cars"] = (
Nbr_vehicles_csv["number cars"].str.replace(" ", "").replace("", np.nan)
)

# Drop rows with NaN values in 'number cars'
Nbr_vehicles_csv = Nbr_vehicles_csv.dropna(subset=["number cars"])

# convert the 'number cars' to integer
Nbr_vehicles_csv["number cars"] = Nbr_vehicles_csv["number cars"].astype(int)
# Drops duplicates and keeps WHO-GHO data in case of duplicates
nbr_vehicles = nbr_vehicles.drop_duplicates(subset=["country"], keep="first")

return Nbr_vehicles_csv
return nbr_vehicles


def download_CO2_emissions():
Expand All @@ -71,6 +108,7 @@ def download_CO2_emissions():
The dataset is downloaded from the following link: https://data.worldbank.org/indicator/EN.CO2.TRAN.ZS?view=map
It is until the year 2014. # TODO: Maybe search for more recent years.
"""

url = (
"https://api.worldbank.org/v2/en/indicator/EN.CO2.TRAN.ZS?downloadformat=excel"
)
Expand All @@ -90,21 +128,21 @@ def download_CO2_emissions():
# Calculate efficiency based on CO2 emissions from transport (% of total fuel combustion)
CO2_emissions["average fuel efficiency"] = (100 - CO2_emissions["2014"]) / 100

# Add ISO2 country code for each country
CO2_emissions = CO2_emissions.dropna(subset=["average fuel efficiency"])

CO2_emissions = CO2_emissions.rename(columns={"Country Name": "Country"})
cc = coco.CountryConverter()
CO2_emissions.loc[:, "country"] = cc.pandas_convert(
series=CO2_emissions["Country"], to="ISO2", not_found="not found"
)

# Drop region names that have no ISO2:
CO2_emissions = CO2_emissions[CO2_emissions.country != "not found"]
CO2_emissions = _add_iso2_code_per_country_and_clean_data(CO2_emissions)

# Drop region names where country column contains list of countries
CO2_emissions = CO2_emissions[
CO2_emissions.country.apply(lambda x: isinstance(x, str))
]
return CO2_emissions
CO2_emissions["average fuel efficiency"] = CO2_emissions[
"average fuel efficiency"
].astype(float)

CO2_emissions.loc[:, "average fuel efficiency"] = CO2_emissions[
"average fuel efficiency"
].round(3)

return CO2_emissions[["country", "average fuel efficiency"]]


if __name__ == "__main__":
Expand All @@ -120,35 +158,22 @@ def download_CO2_emissions():
# store_path_data = Path.joinpath(Path().cwd(), "data")
# country_list = country_list_to_geofk(snakemake.config["countries"])'

# Downloaded and prepare vehicles_csv:
vehicles_csv = download_number_of_vehicles().copy()
# Downloaded and prepare vehicles data:
vehicles_df = download_number_of_vehicles().copy()

# Downloaded and prepare CO2_emissions_csv:
CO2_emissions_csv = download_CO2_emissions().copy()
# Downloaded and prepare CO2_emissions data:
CO2_emissions_df = download_CO2_emissions().copy()

if vehicles_csv.empty or CO2_emissions_csv.empty:
if vehicles_df.empty or CO2_emissions_df.empty:
# In case one of the urls is not working, we can use the hard-coded data
src = BASE_DIR + "/data/temp_hard_coded/transport_data.csv"
dest = snakemake.output.transport_data_input
shutil.copy(src, dest)
else:
# Join the DataFrames by the 'country' column
merged_df = pd.merge(vehicles_csv, CO2_emissions_csv, on="country")
merged_df = pd.merge(vehicles_df, CO2_emissions_df, on="country")
merged_df = merged_df[["country", "number cars", "average fuel efficiency"]]

# Drop rows with NaN values in 'average fuel efficiency'
merged_df = merged_df.dropna(subset=["average fuel efficiency"])

# Convert the 'average fuel efficiency' to float
merged_df["average fuel efficiency"] = merged_df[
"average fuel efficiency"
].astype(float)

# Round the 'average fuel efficiency' to three decimal places
merged_df.loc[:, "average fuel efficiency"] = merged_df[
"average fuel efficiency"
].round(3)

# Save the merged DataFrame to a CSV file
merged_df.to_csv(
snakemake.output.transport_data_input,
Expand Down
Loading