Comand-Line Interface for Retrieving Data #1366

virio-andreyana · 2025-03-04T10:28:56Z

Closes #1361

Changes proposed in this Pull Request

Users have an option alternative to retrieve_databundle_light to download the databundle through a command-line interface (CLI). To do that, write:

python scripts/_cli.py

For example. Here we are missing some data in bundle_data_earth and bundle_hydrobasins.

both databundle can be added simply by writing all
select individual databundle to download
download it manually using the given url for databundles with the direct, zenodo or gdrive sources

Once the option is selected, the bundle list will pas through the retrieve_databundle_light to retrieve that specific file. No need to interact with snakemake for this.

Using the python package rich, you can make good looking command line visualization, suitable for computer clusters. You have to install python package rich first for this to work.

Notes:

Since cutouts all have the same name, the current implementation cannot differentiate if the cutouts is the correct one or not. Maybe I need to validate it through metadata. Suggestion needed
Should I expand this feature to other retrievable data such as download_osm_data and retrieve_cost_data?

Checklist

I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
I tested my contribution locally and it seems to work fine.
Code and workflow changes are sufficiently documented.
Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

for more information, see https://pre-commit.ci

ekatef

Thanks a lot @virio-andreyana! As discussed, that is a really crucial part and addressing it would very much improve usability. Personally, I like the concept you suggest. Added some comments on the implementation.

Given the fact that the PR is addressing the major usability issue, it would be great to ensure that we don't miss anything important. @energyLS @hazemakhalek @davide-f may you have any comments on the solution proposed in this PR?

ekatef · 2025-03-04T22:57:24Z

envs/environment.yaml

+# Command Line Interface
+- rich
+


Just to be sure we understand everything properly, a couple of technical questions:

by any chance, are you aware if rich his major magic capabilities under all popular operation system, right? [Also, would be great to if it could work also in VSCode terminal]

it can be installed also with conda, not only pip which seem to be a recommended installation approach but which can lead to problem when creating the virtual environment?

I've tested in on Linux, VSCode, Windows terminal and cmd and even the coloring scheme is still there.

I don't really know how much difference does using conda or pip make. You can change that based on your preference. What's important is that it has minimal dependencies, so it won't impact other packages.

ekatef · 2025-03-04T23:00:24Z

scripts/_cli.py

+# -*- coding: utf-8 -*-
+# SPDX-FileCopyrightText:  PyPSA-Earth and PyPSA-Eur Authors
+#
+# SPDX-License-Identifier: AGPL-3.0-or-later
+import textwrap


Completely support a modular approach: we have a single helpers file to store technical functions but it's definitely worth to use a more differentiated approach. I'd suggest though to keep helpers as a prefix for consistency. What do you think?

ekatef · 2025-03-04T23:04:42Z

scripts/_cli.py

+configfile = ["config.default.yaml", "configs/bundle_config.yaml", "config.yaml"]
+
+config = {}
+for c in configfile:
+    if os.path.isfile(c):
+        with open(c) as file:
+            config_append = yaml.safe_load(file)
+
+        config.update(config_append)
+


In this fragment we are duplicating configuration management done in Snakefile: those lines create a global variable config consequently merging the dictionaries from yaml files. Can we use this variable directly also here?

The workaround is using the _helper function mock_snakemake("retrieve_databundle_light"). That way you can get the config from Snakefile. But will there be any consequences from this?

ekatef · 2025-03-04T23:06:16Z

scripts/_cli.py

+def console_markdown(markdown, lvl=1):
+
+    console = Console()
+    for i in range(lvl):
+        markdown = textwrap.dedent(markdown)
+    md = Markdown(markdown)
+
+    return console.print(md)
+
+
+def console_table(dataframe, table_kw={}):


I guess the whole this part (up to the line 146) is planned to be moved into _cli file, right? [I recognise that it's a draft, so apologies for a potentially pre-mature comment!]

I've intentionally placed it inside _cli because its command line related things, or should I place this somewhere else like in _helper ?

ekatef · 2025-03-04T23:09:17Z

scripts/_cli.py

+    Options:
+
+    - **check**: update the Databundle Checklist to see if the file is included
+    - **all**: retrieve all missing databundles, namely **{", ".join(missing_bundles)}**
+    - **rerun**: retrieve all databundles again, namely **{", ".join(bundles_to_download)}**
+    - **bundle_...**: retrieve the selected databundles, can be more than one
+    - Press **ENTER** to end this loop
+    """


A love the idea of having such a report! May it be a good idea to try make the message more human-readable? The intended audience of this script are people who are starting to use the model and are not aware of our jargon 🙂 Though, that is just a detail which will be probably more relevant when polishing the implementation

ekatef · 2025-03-05T06:35:21Z

Notes:

Since cutouts all have the same name, the current implementation cannot differentiate if the cutouts is the correct one or not. Maybe I need to validate it through metadata. Suggestion needed
An "optimal" cutout is being selected by get_best_bundles_by_category. This function seeks for the smallest cutout for a requested list of countries. So, it should be still accounted for.

When a cutout is being used, we are also checking that it covers the whole requested area with check_cutout_match. Btw, you have made me recognise that it would be good to add this check also in others scripts where a cutout is being used. Though, it's another task.

Should I expand this feature to other retrievable data such as download_osm_data and retrieve_cost_data?

Not sure there are any substantial issues with retrieval OSM and costs data. Also, the toolset we are using for that is quite different, so working of these parts is definitely worth dedicated PRs.

On my side, I'd say that it would be great to improve further the way how the console output is being managed. We still have a bit too much of irrelevant infos and warnings in the console which should be cleaned, while it may be also a good idea to add some colors to the meaningful console output (e.g. solution status, the objective values). But that is also a bit different story.

virio-andreyana · 2025-03-05T10:43:33Z

An "optimal" cutout is being selected by get_best_bundles_by_category. This function seeks for the smallest cutout for a requested list of countries. So, it should be still accounted for.

When a cutout is being used, we are also checking that it covers the whole requested area with check_cutout_match. Btw, you have made me recognise that it would be good to add this check also in others scripts where a cutout is being used. Though, it's another task.

Hmm, get_best_bundles_by_category is good at selecting which cutout to download, but it cannot check if purely based on the file path if the existing cutout is the appropriate one

On the other hand check_cutout_match achieves this, but requires a geojson shape to be compared to begin with, which is a few workflow steps after retrieve_databundle_light. Unless you want build_shapes to take place before retreive_databundle_light, then it couldn't work

virio-andreyana and others added 3 commits March 4, 2025 11:03

add cli for retreve_databundle_light

fde06dc

[pre-commit.ci] auto fixes from pre-commit.com hooks

c57762f

for more information, see https://pre-commit.ci

passing pre-commit

9914c2b

ekatef reviewed Mar 4, 2025

View reviewed changes

virio-andreyana added 2 commits March 5, 2025 11:46

add Katia's suggestion 1

99670c3

if some files are missing in retrieve, reroute to command-line-interface

ea234ba

virio-andreyana marked this pull request as ready for review March 5, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comand-Line Interface for Retrieving Data #1366

Comand-Line Interface for Retrieving Data #1366

virio-andreyana commented Mar 4, 2025 •

edited

Loading

ekatef left a comment

ekatef Mar 4, 2025

virio-andreyana Mar 5, 2025

ekatef Mar 4, 2025

ekatef Mar 4, 2025

virio-andreyana Mar 5, 2025

ekatef Mar 4, 2025

virio-andreyana Mar 5, 2025

ekatef Mar 4, 2025

ekatef commented Mar 5, 2025

virio-andreyana commented Mar 5, 2025

Comand-Line Interface for Retrieving Data #1366

Are you sure you want to change the base?

Comand-Line Interface for Retrieving Data #1366

Conversation

virio-andreyana commented Mar 4, 2025 • edited Loading

Closes #1361

Changes proposed in this Pull Request

Checklist

ekatef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekatef commented Mar 5, 2025

virio-andreyana commented Mar 5, 2025

virio-andreyana commented Mar 4, 2025 •

edited

Loading