-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wrapper for csg #286
Open
JGuetschow
wants to merge
4
commits into
main
Choose a base branch
from
csg_regression_test
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add wrapper for csg #286
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Added a wrapper for the csg `compose` function to handle input data preparation (remove data which is not needed in the process) and output data handling (set coords and metadata) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,8 @@ When no missing information is left in the result timeseries, the algorithm term | |
It also terminates if all source timeseries are used, even if missing information is | ||
left. | ||
|
||
## The `compose` function | ||
|
||
The core function to use is the {py:func}`primap2.csg.compose` function. | ||
It needs the following input: | ||
|
||
|
@@ -111,8 +113,7 @@ priority_definition = primap2.csg.PriorityDefinition( | |
``` | ||
|
||
```{code-cell} ipython3 | ||
# Currently, there is only one strategy implemented, so we use | ||
# the empty selector {}, which matches everything, to configure | ||
# We use the empty selector {}, which matches everything, to configure | ||
# to use the substitution strategy for all timeseries. | ||
strategy_definition = primap2.csg.StrategyDefinition( | ||
strategies=[({}, primap2.csg.SubstitutionStrategy())] | ||
|
@@ -125,6 +126,7 @@ result_ds = primap2.csg.compose( | |
priority_definition=priority_definition, | ||
strategy_definition=strategy_definition, | ||
progress_bar=None, # The animated progress bar is useless in the generated documentation | ||
|
||
) | ||
|
||
result_ds | ||
|
@@ -162,3 +164,62 @@ category 1 "lowpop" was preferred. | |
For category 0, the initial timeseries did not contain NaNs, so no filling was needed. | ||
For category 1, there was information missing in the initial timeseries, so the | ||
lower-priority timeseries was used to fill the holes. | ||
|
||
## The `create_composite_source` wrapper function | ||
|
||
The {py:func}`primap2.csg.compose` function creates a composite time series according to | ||
the given priorities and strategies, but it does not take care of pre- and postprocessing | ||
of the data. It will carry along unnecessary data and the resulting dataset will miss the | ||
priority coordinates. The {py:func}`primap2.csg.create_composite_source` function takes acre | ||
of these steps and prepares the input data and completes the output data to a primap2 dataset | ||
with all desired dimensions and metadata. | ||
|
||
The function takes the same inputs as {py:func}`primap2.csg.compose` with additional input to | ||
define pre- and postprocessing: | ||
|
||
* **result_prio_coords** Defines the vales for the priority coordinates in the output dataset. As the | ||
priority coordinates differ for all input sources there is no canonical vale | ||
for the result and it has to be explicitly defined | ||
* **metadata** Set metadata values such as title and references | ||
|
||
```{code-cell} ipython3 | ||
result_prio_coords = result_prio_coords = { | ||
"source": {"value": "PRIMAP-test"}, | ||
"scenario": {"value": "HISTORY", "terminology": "PRIMAP"}, | ||
} | ||
metadata = {"references": "test-data", "contact": "[email protected]"} | ||
|
||
``` | ||
|
||
* **limit_coords** Optional parameter to remove data for coordinate vales not needed for the | ||
composition from the input data. The time coordinate is treated separately. | ||
* **time_range** Optional parameter to limit the time coverage of the input data. Currently | ||
only (year_from, year_to) is supported | ||
|
||
|
||
```{code-cell} ipython3 | ||
limit_coords = {'area (ISO3)': ['COL', 'ARG', 'MEX']} | ||
time_range = ("2000", "2010") | ||
|
||
``` | ||
|
||
```{code-cell} ipython3 | ||
complete_result_ds = primap2.csg.create_composite_source( | ||
input_ds, | ||
priority_definition=priority_definition, | ||
strategy_definition=strategy_definition, | ||
result_prio_coords=result_prio_coords, | ||
limit_coords=limit_coords, | ||
time_range=time_range, | ||
metadata=metadata, | ||
progress_bar=None, | ||
) | ||
|
||
complete_result_ds | ||
``` | ||
|
||
|
||
## Filling strategies | ||
Currently the following filling strategies are implemented | ||
* Global least square matching: {py:func}`primap2.csg.GlobalLSStrategy` | ||
* Straight substitution: {py:func}`primap2.csg.SubstitutionStrategy` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
import pandas as pd | ||
import tqdm | ||
import xarray as xr | ||
|
||
from ._compose import compose | ||
from ._models import PriorityDefinition, StrategyDefinition | ||
|
||
|
||
def set_priority_coords( | ||
ds: xr.Dataset, | ||
dims: dict[str, dict[str, str]], | ||
) -> xr.Dataset: | ||
"""Set values for priority coordinates in output dataset | ||
|
||
coords: Dictionary | ||
Format is 'name': {'value': value, 'terminology': terminology} | ||
terminology is optional | ||
|
||
""" | ||
for dim in dims.keys(): | ||
if "terminology" in dims[dim].keys(): | ||
terminology = dims[dim]["terminology"] | ||
else: | ||
terminology = None | ||
ds = ds.pr.expand_dims(dim=dim, coord_value=dims[dim]["value"], terminology=terminology) | ||
|
||
return ds | ||
|
||
|
||
def create_composite_source( | ||
input_ds: xr.Dataset, | ||
priority_definition: PriorityDefinition, | ||
strategy_definition: StrategyDefinition, | ||
result_prio_coords: dict[str, dict[str, str]], | ||
limit_coords: dict[str, str | list[str]] | None = None, | ||
time_range: tuple[str, str] | None = None, | ||
metadata: dict[str, str] | None = None, | ||
progress_bar: type[tqdm.tqdm] | None = tqdm.tqdm, | ||
) -> xr.Dataset: | ||
"""Create a composite data source | ||
|
||
This is a wrapper around `primap2.csg.compose` that prepares the input data and sets result | ||
values for the priority coordinates. | ||
|
||
|
||
Parameters | ||
---------- | ||
input_ds | ||
Dataset containing all input data | ||
priority_definition | ||
Defines the priorities to select timeseries from the input data. Priorities | ||
are formed by a list of selections and are used "from left to right", where the | ||
first matching selection has the highest priority. Each selection has to specify | ||
values for all priority dimensions (so that exactly one timeseries is selected | ||
from the input data), but can also specify other dimensions. That way it is, | ||
e.g., possible to define a different priority for a specific country by listing | ||
it early (i.e. with high priority) before the more general rules which should | ||
be applied for all other countries. | ||
You can also specify the "entity" or "variable" in the selection, which will | ||
limit the rule to a specific entity or variable, respectively. For each | ||
DataArray in the input_data Dataset, the variable is its name, the entity is | ||
the value of the key `entity` in its attrs. | ||
strategy_definition | ||
Defines the filling strategies to be used when filling timeseries with other | ||
timeseries. Again, the priority is defined by a list of selections and | ||
corresponding strategies which are used "from left to right". Selections can use | ||
any dimension and don't have to apply to only one timeseries. For example, to | ||
define a default strategy which should be used for all timeseries unless | ||
something else is configured, configure an empty selection as the last | ||
(rightmost) entry. | ||
You can also specify the "entity" or "variable" in the selection, which will | ||
limit the rule to a specific entity or variable, respectively. For each | ||
DataArray in the input_data Dataset, the variable is its name, the entity is | ||
the value of the key `entity` in its attrs. | ||
result_prio_coords | ||
Defines the vales for the priority coordinates in the output dataset. As the | ||
priority coordinates differ for all input sources there is no canonical vale | ||
for the result and it has to be explicitly defined | ||
limit_coords | ||
Optional parameter to remove data for coordinate vales not needed for the | ||
composition from the input data. The time coordinate is treated separately. | ||
time_range | ||
Optional parameter to limit the time coverage of the input data. Currently | ||
only (year_from, year_to) is supported | ||
metadata | ||
Set metadata values such as title and references | ||
progress_bar | ||
By default, show progress bars using the tqdm package during the | ||
operation. If None, don't show any progress bars. You can supply a class | ||
compatible to tqdm.tqdm's protocol if you want to customize the progress bar. | ||
|
||
Returns | ||
------- | ||
xr.Dataset with composed data according to the given priority and strategy | ||
definitions | ||
|
||
""" | ||
|
||
# limit input data to these values | ||
if limit_coords is not None: | ||
if "variable" in limit_coords.keys(): | ||
variables = limit_coords["variable"] | ||
limit_coords.pop("variable") | ||
input_ds = input_ds[variables].pr.loc[limit_coords] | ||
|
||
else: | ||
input_ds = input_ds.pr.loc[limit_coords] | ||
|
||
# set time range according to input | ||
if time_range is not None: | ||
input_ds = input_ds.pr.loc[ | ||
{"time": pd.date_range(time_range[0], time_range[1], freq="YS", inclusive="both")} | ||
] | ||
|
||
# run compose | ||
result_ds = compose( | ||
input_data=input_ds, | ||
priority_definition=priority_definition, | ||
strategy_definition=strategy_definition, | ||
progress_bar=progress_bar, | ||
) | ||
|
||
# set priority coordinates | ||
result_ds = set_priority_coords(result_ds, result_prio_coords) | ||
|
||
if metadata is not None: | ||
for key in metadata.keys(): | ||
result_ds.attrs[key] = metadata[key] | ||
|
||
result_ds.pr.ensure_valid() | ||
|
||
return result_ds |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the very limited time range functionality. Maybe just leave it out or explicitly call it
start_year
andend_year
, so people don't expect anything more fancy? The problem is otherwise that thetime_range
name is already taken in the public API, and when we need the possibility to filter for months or exclude the year 2020 or something, it will be awkward to include the new functionality. With the namesstart_year
andend_year
, we can then add atime_range
parameter which takes a pd.date_range object directly or whatever works for us.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we can make it more versatile later. But I think I can do that now because I've spent the last two weeks fighting with time ranges in xarray, so It should be a quick fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would limit the option to slice and datetime index. When using the datetime index users would have to make sure the values actually exist.
For the composite source generator the only use case I can think of is a range though. But it's good if it's more versatile than just start and end year and e.g. the interval can be specified.