Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wrapper for csg #286

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog/286.improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added a wrapper for the csg `compose` function to handle input data preparation (remove data which is not needed in the process) and output data handling (set coords and metadata)
3 changes: 3 additions & 0 deletions docs/source/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ source priorities and matching algorithms.
csg.StrategyUnableToProcess
csg.SubstitutionStrategy
csg.compose
csg.create_composite_source
csg.set_priority_coords


.. currentmodule:: xarray
Expand Down Expand Up @@ -96,6 +98,7 @@ Methods
DataArray.pr.add_aggregates_coordinates
DataArray.pr.any
DataArray.pr.combine_first
DataArray.pr.convert
DataArray.pr.convert_to_gwp
DataArray.pr.convert_to_gwp_like
DataArray.pr.convert_to_mass
Expand Down
65 changes: 63 additions & 2 deletions docs/source/usage/csg.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ When no missing information is left in the result timeseries, the algorithm term
It also terminates if all source timeseries are used, even if missing information is
left.

## The `compose` function

The core function to use is the {py:func}`primap2.csg.compose` function.
It needs the following input:

Expand Down Expand Up @@ -111,8 +113,7 @@ priority_definition = primap2.csg.PriorityDefinition(
```

```{code-cell} ipython3
# Currently, there is only one strategy implemented, so we use
# the empty selector {}, which matches everything, to configure
# We use the empty selector {}, which matches everything, to configure
# to use the substitution strategy for all timeseries.
strategy_definition = primap2.csg.StrategyDefinition(
strategies=[({}, primap2.csg.SubstitutionStrategy())]
Expand All @@ -125,6 +126,7 @@ result_ds = primap2.csg.compose(
priority_definition=priority_definition,
strategy_definition=strategy_definition,
progress_bar=None, # The animated progress bar is useless in the generated documentation

)

result_ds
Expand Down Expand Up @@ -162,3 +164,62 @@ category 1 "lowpop" was preferred.
For category 0, the initial timeseries did not contain NaNs, so no filling was needed.
For category 1, there was information missing in the initial timeseries, so the
lower-priority timeseries was used to fill the holes.

## The `create_composite_source` wrapper function

The {py:func}`primap2.csg.compose` function creates a composite time series according to
the given priorities and strategies, but it does not take care of pre- and postprocessing
of the data. It will carry along unnecessary data and the resulting dataset will miss the
priority coordinates. The {py:func}`primap2.csg.create_composite_source` function takes acre
of these steps and prepares the input data and completes the output data to a primap2 dataset
with all desired dimensions and metadata.

The function takes the same inputs as {py:func}`primap2.csg.compose` with additional input to
define pre- and postprocessing:

* **result_prio_coords** Defines the vales for the priority coordinates in the output dataset. As the
priority coordinates differ for all input sources there is no canonical vale
for the result and it has to be explicitly defined
* **metadata** Set metadata values such as title and references

```{code-cell} ipython3
result_prio_coords = result_prio_coords = {
"source": {"value": "PRIMAP-test"},
"scenario": {"value": "HISTORY", "terminology": "PRIMAP"},
}
metadata = {"references": "test-data", "contact": "[email protected]"}

```

* **limit_coords** Optional parameter to remove data for coordinate vales not needed for the
composition from the input data. The time coordinate is treated separately.
* **time_range** Optional parameter to limit the time coverage of the input data. Currently
only (year_from, year_to) is supported


```{code-cell} ipython3
limit_coords = {'area (ISO3)': ['COL', 'ARG', 'MEX']}
time_range = ("2000", "2010")

```

```{code-cell} ipython3
complete_result_ds = primap2.csg.create_composite_source(
input_ds,
priority_definition=priority_definition,
strategy_definition=strategy_definition,
result_prio_coords=result_prio_coords,
limit_coords=limit_coords,
time_range=time_range,
metadata=metadata,
progress_bar=None,
)

complete_result_ds
```


## Filling strategies
Currently the following filling strategies are implemented
* Global least square matching: {py:func}`primap2.csg.GlobalLSStrategy`
* Straight substitution: {py:func}`primap2.csg.SubstitutionStrategy`
3 changes: 3 additions & 0 deletions primap2/csg/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from ._strategies.exceptions import StrategyUnableToProcess
from ._strategies.global_least_squares import GlobalLSStrategy
from ._strategies.substitution import SubstitutionStrategy
from ._wrapper import create_composite_source, set_priority_coords

__all__ = [
"compose",
Expand All @@ -21,4 +22,6 @@
"SubstitutionStrategy",
"StrategyUnableToProcess",
"GlobalLSStrategy",
"create_composite_source",
"set_priority_coords",
]
132 changes: 132 additions & 0 deletions primap2/csg/_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import pandas as pd
import tqdm
import xarray as xr

from ._compose import compose
from ._models import PriorityDefinition, StrategyDefinition


def set_priority_coords(
ds: xr.Dataset,
dims: dict[str, dict[str, str]],
) -> xr.Dataset:
"""Set values for priority coordinates in output dataset

coords: Dictionary
Format is 'name': {'value': value, 'terminology': terminology}
terminology is optional

"""
for dim in dims.keys():
if "terminology" in dims[dim].keys():
terminology = dims[dim]["terminology"]
else:
terminology = None
ds = ds.pr.expand_dims(dim=dim, coord_value=dims[dim]["value"], terminology=terminology)

return ds


def create_composite_source(
input_ds: xr.Dataset,
priority_definition: PriorityDefinition,
strategy_definition: StrategyDefinition,
result_prio_coords: dict[str, dict[str, str]],
limit_coords: dict[str, str | list[str]] | None = None,
time_range: tuple[str, str] | None = None,
metadata: dict[str, str] | None = None,
progress_bar: type[tqdm.tqdm] | None = tqdm.tqdm,
) -> xr.Dataset:
"""Create a composite data source

This is a wrapper around `primap2.csg.compose` that prepares the input data and sets result
values for the priority coordinates.


Parameters
----------
input_ds
Dataset containing all input data
priority_definition
Defines the priorities to select timeseries from the input data. Priorities
are formed by a list of selections and are used "from left to right", where the
first matching selection has the highest priority. Each selection has to specify
values for all priority dimensions (so that exactly one timeseries is selected
from the input data), but can also specify other dimensions. That way it is,
e.g., possible to define a different priority for a specific country by listing
it early (i.e. with high priority) before the more general rules which should
be applied for all other countries.
You can also specify the "entity" or "variable" in the selection, which will
limit the rule to a specific entity or variable, respectively. For each
DataArray in the input_data Dataset, the variable is its name, the entity is
the value of the key `entity` in its attrs.
strategy_definition
Defines the filling strategies to be used when filling timeseries with other
timeseries. Again, the priority is defined by a list of selections and
corresponding strategies which are used "from left to right". Selections can use
any dimension and don't have to apply to only one timeseries. For example, to
define a default strategy which should be used for all timeseries unless
something else is configured, configure an empty selection as the last
(rightmost) entry.
You can also specify the "entity" or "variable" in the selection, which will
limit the rule to a specific entity or variable, respectively. For each
DataArray in the input_data Dataset, the variable is its name, the entity is
the value of the key `entity` in its attrs.
result_prio_coords
Defines the vales for the priority coordinates in the output dataset. As the
priority coordinates differ for all input sources there is no canonical vale
for the result and it has to be explicitly defined
limit_coords
Optional parameter to remove data for coordinate vales not needed for the
composition from the input data. The time coordinate is treated separately.
time_range
Optional parameter to limit the time coverage of the input data. Currently
only (year_from, year_to) is supported
metadata
Set metadata values such as title and references
progress_bar
By default, show progress bars using the tqdm package during the
operation. If None, don't show any progress bars. You can supply a class
compatible to tqdm.tqdm's protocol if you want to customize the progress bar.

Returns
-------
xr.Dataset with composed data according to the given priority and strategy
definitions

"""

# limit input data to these values
if limit_coords is not None:
if "variable" in limit_coords.keys():
variables = limit_coords["variable"]
limit_coords.pop("variable")
input_ds = input_ds[variables].pr.loc[limit_coords]

else:
input_ds = input_ds.pr.loc[limit_coords]

Check warning on line 107 in primap2/csg/_wrapper.py

View check run for this annotation

Codecov / codecov/patch

primap2/csg/_wrapper.py#L107

Added line #L107 was not covered by tests

# set time range according to input
if time_range is not None:
input_ds = input_ds.pr.loc[
{"time": pd.date_range(time_range[0], time_range[1], freq="YS", inclusive="both")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the very limited time range functionality. Maybe just leave it out or explicitly call it start_year and end_year, so people don't expect anything more fancy? The problem is otherwise that the time_range name is already taken in the public API, and when we need the possibility to filter for months or exclude the year 2020 or something, it will be awkward to include the new functionality. With the names start_year and end_year, we can then add a time_range parameter which takes a pd.date_range object directly or whatever works for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we can make it more versatile later. But I think I can do that now because I've spent the last two weeks fighting with time ranges in xarray, so It should be a quick fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would limit the option to slice and datetime index. When using the datetime index users would have to make sure the values actually exist.
For the composite source generator the only use case I can think of is a range though. But it's good if it's more versatile than just start and end year and e.g. the interval can be specified.

]

# run compose
result_ds = compose(
input_data=input_ds,
priority_definition=priority_definition,
strategy_definition=strategy_definition,
progress_bar=progress_bar,
)

# set priority coordinates
result_ds = set_priority_coords(result_ds, result_prio_coords)

if metadata is not None:
for key in metadata.keys():
result_ds.attrs[key] = metadata[key]

result_ds.pr.ensure_valid()

return result_ds
Loading