Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selection of input data along time coordinate fails #68

Open
observingClouds opened this issue Feb 12, 2025 · 5 comments
Open

Selection of input data along time coordinate fails #68

observingClouds opened this issue Feb 12, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@observingClouds
Copy link
Contributor

Thanks @matschreiner for your great work in #55. I just tried this and run into an issue when doing a selection along the time dimension.

What I did

First I provided in my config file a start and end datetime, which resulted in:

E       dataclass_wizard.errors.ParseError: Failure parsing field `start` in class `Range`. Expected a type [<class 'str'>, <class 'int'>, <class 'float'>], got datetime.
E         value: datetime.datetime(2022, 4, 1, 0, 0)
E         error: Object was not in any of Union types
E         tag_key: '__tag__'
E         json_object: '{"start": "2022-04-01T00:00:00", "end": "2022-04-01T03:00:00"}'

Second, I tried providing the time as a string, but this resulted in

    def check_point_in_dataset(coord, point, ds):
        """
        check that the requested point is in the data.
        """
        if point is not None and point not in ds[coord].values:
>           raise ValueError(
                f"Provided value for coordinate {coord} ({point}) is not in the data."
            )
E           ValueError: Provided value for coordinate time (2022-04-10 00:00:00) is not in the data.

The second issue stems from check_point_in_dataset() which does not do time conversions, e.g. str (provided in config) and datetime in dataset and therefore fails, even if the time is available:

>>> import xarray as xr
>>> ds = xr.open_zarr("https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/pressure_levels.zarr")
>>> ds.sel({'time': slice("2022-04-01T00:00:00","2022-04-01T03:00:00")}).time
Out[8]: 
<xarray.DataArray 'time' (time: 2)> Size: 16B
array(['2022-04-01T00:00:00.000000000', '2022-04-01T03:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 16B 2022-04-01 2022-04-01T03:00:00
Attributes:
    standard_name:  time
>>> "2022-04-01T00:00:00" in ds['time'].values
False

Also, is there a reason why we call check_point_in_dataset() only in case of a coordinate is named time? Do we need this test at all? Isn't xarray raising already a good error message?

What I expected
I expected both of my trials to be working.

@matschreiner
Copy link
Contributor

@observingClouds I'll have a look at this next week.
I agree that we should not check if the exact point is in the dataset and let xarray handle slicing and errors all together.
@leifdenby would you be okay with removing this check? I think we had a little discussion about it.

@matschreiner
Copy link
Contributor

@observingClouds Could you help me reproduce the first error?
It looks like somehow a datetime object is passed to the datacllass_wizard - I have no experience with this, but how does it happen since you can only write strings or ints/floats in the yaml file?

I tried adding datetime as an accepted datatype of Range's start and end points, and then I wrote a test that the range should instantiate using datetime objects, and it passed. But it looks like it doesn't check the type.

If I can reproduce the error I'll write a test that I can work from.

@observingClouds
Copy link
Contributor Author

Sure, you can write also the time without quotes:

datetime type

    coord_ranges:
      time:
        start: 2022-04-10T00:00:00
        end: 2022-04-11T00:00:00

vs.

string type

    coord_ranges:
      time:
        start: "2022-04-10T00:00:00"
        end: "2022-04-11T00:00:00"

@matschreiner
Copy link
Contributor

@observingClouds
Hm, I have created these two tests:

def test_can_load_config_with_datetime_object_in_time_range():
    fp = "tests/resources/sliced_example.danra.yaml"
    mdp.Config.from_yaml_file(fp)

def test_can_load_config_with_datetime_string_in_time_range():
    fp = "tests/resources/sliced_example_with_datetime_strings.danra.yaml"
    mdp.Config.from_yaml_file(fp)

Where I have

    coord_ranges:
      time:
        start: 1990-09-03T00:00
        end: 1990-09-09T00:00
        step: PT3H

in one and

    coord_ranges:
      time:
        start: "1990-09-03T00:00"
        end: "1990-09-09T00:00"
        step: "PT3H"

in the other, but I can't seem to reproduce your error, both tests pass..

@matschreiner matschreiner mentioned this issue Feb 14, 2025
20 tasks
@observingClouds
Copy link
Contributor Author

Alright, here is the script that produced the error for me:

import mllam_data_prep as mdp
import pytest
import yaml

import mllam_data_prep as mdp

with open("example.danra.yaml", "r") as file:
    BASE_CONFIG = file.read()

HEIGHT_LEVEL_TEST_SECTION = """\
inputs:
  danra_height_levels:
    path: https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/height_levels.zarr
    dims: [time, x, y, altitude]
    variables:
      u:
        altitude:
          values: [100, 50,]
          units: m
      v:
        altitude:
          values: [100, 50, ]
          units: m
    dim_mapping:
      time:
        method: rename
        dim: time
      state_feature:
        method: stack_variables_by_var_name
        dims: [altitude]
        name_format: "{var_name}{altitude}m"
      grid_index:
        method: stack
        dims: [x, y]
    coord_ranges:
      time:
        start: 2022-04-01T00:00:00
        end: 2022-04-01T03:00:00
    target_output_variable: state
"""


def update_config(config: str, update: str):
    """
    Update provided config.

    Parameters
    ----------
    config: str
        String with config in yaml format
    update: str
        String with the update in yaml format

    Returns
    -------
    config: Config
        Updated config
    """
    original_config = mdp.Config.from_yaml(config)
    update = yaml.safe_load(update)
    modified_config = original_config.to_dict()
    modified_config.update(update)
    modified_config = mdp.Config.from_dict(modified_config)

    return modified_config


config = update_config(BASE_CONFIG, HEIGHT_LEVEL_TEST_SECTION)
ds = mdp.create_dataset(config=config)
print(any(ds.isnull().any().compute().to_array()))
# nan_in_ds = any(ds.isnull().any().to_array())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants