Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comand-Line Interface for Retrieving Data #1366

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

virio-andreyana
Copy link
Contributor

@virio-andreyana virio-andreyana commented Mar 4, 2025

Closes #1361

Changes proposed in this Pull Request

Users have an option alternative to retrieve_databundle_light to download the databundle through a command-line interface (CLI). To do that, write:

python scripts/_cli.py

image

For example. Here we are missing some data in bundle_data_earth and bundle_hydrobasins.

  • both databundle can be added simply by writing all
  • select individual databundle to download
  • download it manually using the given url for databundles with the direct, zenodo or gdrive sources

Once the option is selected, the bundle list will pas through the retrieve_databundle_light to retrieve that specific file. No need to interact with snakemake for this.

Using the python package rich, you can make good looking command line visualization, suitable for computer clusters. You have to install python package rich first for this to work.

Notes:

  • Since cutouts all have the same name, the current implementation cannot differentiate if the cutouts is the correct one or not. Maybe I need to validate it through metadata. Suggestion needed
  • Should I expand this feature to other retrievable data such as download_osm_data and retrieve_cost_data?

Checklist

  • I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
  • I tested my contribution locally and it seems to work fine.
  • Code and workflow changes are sufficiently documented.
  • Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
  • Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
  • Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
  • Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
  • A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

Copy link
Member

@ekatef ekatef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @virio-andreyana! As discussed, that is a really crucial part and addressing it would very much improve usability. Personally, I like the concept you suggest. Added some comments on the implementation.

Given the fact that the PR is addressing the major usability issue, it would be great to ensure that we don't miss anything important. @energyLS @hazemakhalek @davide-f may you have any comments on the solution proposed in this PR?

Comment on lines +88 to +90
# Command Line Interface
- rich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we understand everything properly, a couple of technical questions:

  1. by any chance, are you aware if rich his major magic capabilities under all popular operation system, right? [Also, would be great to if it could work also in VSCode terminal]
  2. it can be installed also with conda, not only pip which seem to be a recommended installation approach but which can lead to problem when creating the virtual environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I've tested in on Linux, VSCode, Windows terminal and cmd and even the coloring scheme is still there.
  2. I don't really know how much difference does using conda or pip make. You can change that based on your preference. What's important is that it has minimal dependencies, so it won't impact other packages.

Comment on lines +1 to +5
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: PyPSA-Earth and PyPSA-Eur Authors
#
# SPDX-License-Identifier: AGPL-3.0-or-later
import textwrap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely support a modular approach: we have a single helpers file to store technical functions but it's definitely worth to use a more differentiated approach. I'd suggest though to keep helpers as a prefix for consistency. What do you think?

scripts/_cli.py Outdated
Comment on lines 13 to 22
configfile = ["config.default.yaml", "configs/bundle_config.yaml", "config.yaml"]

config = {}
for c in configfile:
if os.path.isfile(c):
with open(c) as file:
config_append = yaml.safe_load(file)

config.update(config_append)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this fragment we are duplicating configuration management done in Snakefile: those lines create a global variable config consequently merging the dictionaries from yaml files. Can we use this variable directly also here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround is using the _helper function mock_snakemake("retrieve_databundle_light"). That way you can get the config from Snakefile. But will there be any consequences from this?

Comment on lines +24 to +34
def console_markdown(markdown, lvl=1):

console = Console()
for i in range(lvl):
markdown = textwrap.dedent(markdown)
md = Markdown(markdown)

return console.print(md)


def console_table(dataframe, table_kw={}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the whole this part (up to the line 146) is planned to be moved into _cli file, right? [I recognise that it's a draft, so apologies for a potentially pre-mature comment!]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've intentionally placed it inside _cli because its command line related things, or should I place this somewhere else like in _helper ?

scripts/_cli.py Outdated
Comment on lines 136 to 143
Options:

- **check**: update the Databundle Checklist to see if the file is included
- **all**: retrieve all missing databundles, namely **{", ".join(missing_bundles)}**
- **rerun**: retrieve all databundles again, namely **{", ".join(bundles_to_download)}**
- **bundle_...**: retrieve the selected databundles, can be more than one
- Press **ENTER** to end this loop
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A love the idea of having such a report! May it be a good idea to try make the message more human-readable? The intended audience of this script are people who are starting to use the model and are not aware of our jargon 🙂 Though, that is just a detail which will be probably more relevant when polishing the implementation

@ekatef
Copy link
Member

ekatef commented Mar 5, 2025

Notes:

  • Since cutouts all have the same name, the current implementation cannot differentiate if the cutouts is the correct one or not. Maybe I need to validate it through metadata. Suggestion needed
    An "optimal" cutout is being selected by get_best_bundles_by_category. This function seeks for the smallest cutout for a requested list of countries. So, it should be still accounted for.

When a cutout is being used, we are also checking that it covers the whole requested area with check_cutout_match. Btw, you have made me recognise that it would be good to add this check also in others scripts where a cutout is being used. Though, it's another task.

  • Should I expand this feature to other retrievable data such as download_osm_data and retrieve_cost_data?

Not sure there are any substantial issues with retrieval OSM and costs data. Also, the toolset we are using for that is quite different, so working of these parts is definitely worth dedicated PRs.

On my side, I'd say that it would be great to improve further the way how the console output is being managed. We still have a bit too much of irrelevant infos and warnings in the console which should be cleaned, while it may be also a good idea to add some colors to the meaningful console output (e.g. solution status, the objective values). But that is also a bit different story.

@virio-andreyana
Copy link
Contributor Author

An "optimal" cutout is being selected by get_best_bundles_by_category. This function seeks for the smallest cutout for a requested list of countries. So, it should be still accounted for.

When a cutout is being used, we are also checking that it covers the whole requested area with check_cutout_match. Btw, you have made me recognise that it would be good to add this check also in others scripts where a cutout is being used. Though, it's another task.

Hmm, get_best_bundles_by_category is good at selecting which cutout to download, but it cannot check if purely based on the file path if the existing cutout is the appropriate one

On the other hand check_cutout_match achieves this, but requires a geojson shape to be compared to begin with, which is a few workflow steps after retrieve_databundle_light. Unless you want build_shapes to take place before retreive_databundle_light, then it couldn't work

@virio-andreyana virio-andreyana marked this pull request as ready for review March 5, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make data retrieval more robust
2 participants