Selection of input data along time coordinate fails #68

observingClouds · 2025-02-12T13:44:09Z

Thanks @matschreiner for your great work in #55. I just tried this and run into an issue when doing a selection along the time dimension.

What I did

First I provided in my config file a start and end datetime, which resulted in:

E       dataclass_wizard.errors.ParseError: Failure parsing field `start` in class `Range`. Expected a type [<class 'str'>, <class 'int'>, <class 'float'>], got datetime.
E         value: datetime.datetime(2022, 4, 1, 0, 0)
E         error: Object was not in any of Union types
E         tag_key: '__tag__'
E         json_object: '{"start": "2022-04-01T00:00:00", "end": "2022-04-01T03:00:00"}'

Second, I tried providing the time as a string, but this resulted in

    def check_point_in_dataset(coord, point, ds):
        """
        check that the requested point is in the data.
        """
        if point is not None and point not in ds[coord].values:
>           raise ValueError(
                f"Provided value for coordinate {coord} ({point}) is not in the data."
            )
E           ValueError: Provided value for coordinate time (2022-04-10 00:00:00) is not in the data.

The second issue stems from check_point_in_dataset() which does not do time conversions, e.g. str (provided in config) and datetime in dataset and therefore fails, even if the time is available:

>>> import xarray as xr
>>> ds = xr.open_zarr("https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/pressure_levels.zarr")
>>> ds.sel({'time': slice("2022-04-01T00:00:00","2022-04-01T03:00:00")}).time
Out[8]: 
<xarray.DataArray 'time' (time: 2)> Size: 16B
array(['2022-04-01T00:00:00.000000000', '2022-04-01T03:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 16B 2022-04-01 2022-04-01T03:00:00
Attributes:
    standard_name:  time
>>> "2022-04-01T00:00:00" in ds['time'].values
False

Also, is there a reason why we call check_point_in_dataset() only in case of a coordinate is named time? Do we need this test at all? Isn't xarray raising already a good error message?

What I expected
I expected both of my trials to be working.

The text was updated successfully, but these errors were encountered:

matschreiner · 2025-02-12T15:22:44Z

@observingClouds I'll have a look at this next week.
I agree that we should not check if the exact point is in the dataset and let xarray handle slicing and errors all together.
@leifdenby would you be okay with removing this check? I think we had a little discussion about it.

matschreiner · 2025-02-13T18:49:43Z

@observingClouds Could you help me reproduce the first error?
It looks like somehow a datetime object is passed to the datacllass_wizard - I have no experience with this, but how does it happen since you can only write strings or ints/floats in the yaml file?

I tried adding datetime as an accepted datatype of Range's start and end points, and then I wrote a test that the range should instantiate using datetime objects, and it passed. But it looks like it doesn't check the type.

If I can reproduce the error I'll write a test that I can work from.

observingClouds · 2025-02-13T19:07:02Z

Sure, you can write also the time without quotes:

datetime type

    coord_ranges:
      time:
        start: 2022-04-10T00:00:00
        end: 2022-04-11T00:00:00

vs.

string type

    coord_ranges:
      time:
        start: "2022-04-10T00:00:00"
        end: "2022-04-11T00:00:00"

matschreiner · 2025-02-14T12:31:04Z

@observingClouds
Hm, I have created these two tests:

def test_can_load_config_with_datetime_object_in_time_range():
    fp = "tests/resources/sliced_example.danra.yaml"
    mdp.Config.from_yaml_file(fp)

def test_can_load_config_with_datetime_string_in_time_range():
    fp = "tests/resources/sliced_example_with_datetime_strings.danra.yaml"
    mdp.Config.from_yaml_file(fp)

Where I have

    coord_ranges:
      time:
        start: 1990-09-03T00:00
        end: 1990-09-09T00:00
        step: PT3H

in one and

    coord_ranges:
      time:
        start: "1990-09-03T00:00"
        end: "1990-09-09T00:00"
        step: "PT3H"

in the other, but I can't seem to reproduce your error, both tests pass..

observingClouds · 2025-02-15T15:57:42Z

Alright, here is the script that produced the error for me:

import mllam_data_prep as mdp
import pytest
import yaml

import mllam_data_prep as mdp

with open("example.danra.yaml", "r") as file:
    BASE_CONFIG = file.read()

HEIGHT_LEVEL_TEST_SECTION = """\
inputs:
  danra_height_levels:
    path: https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/height_levels.zarr
    dims: [time, x, y, altitude]
    variables:
      u:
        altitude:
          values: [100, 50,]
          units: m
      v:
        altitude:
          values: [100, 50, ]
          units: m
    dim_mapping:
      time:
        method: rename
        dim: time
      state_feature:
        method: stack_variables_by_var_name
        dims: [altitude]
        name_format: "{var_name}{altitude}m"
      grid_index:
        method: stack
        dims: [x, y]
    coord_ranges:
      time:
        start: 2022-04-01T00:00:00
        end: 2022-04-01T03:00:00
    target_output_variable: state
"""


def update_config(config: str, update: str):
    """
    Update provided config.

    Parameters
    ----------
    config: str
        String with config in yaml format
    update: str
        String with the update in yaml format

    Returns
    -------
    config: Config
        Updated config
    """
    original_config = mdp.Config.from_yaml(config)
    update = yaml.safe_load(update)
    modified_config = original_config.to_dict()
    modified_config.update(update)
    modified_config = mdp.Config.from_dict(modified_config)

    return modified_config


config = update_config(BASE_CONFIG, HEIGHT_LEVEL_TEST_SECTION)
ds = mdp.create_dataset(config=config)
print(any(ds.isnull().any().compute().to_array()))
# nan_in_ds = any(ds.isnull().any().to_array())

observingClouds assigned matschreiner Feb 12, 2025

observingClouds added the bug Something isn't working label Feb 12, 2025

observingClouds mentioned this issue Feb 12, 2025

Provided step is ignored in coordinate selection #69

Open

observingClouds mentioned this issue Feb 13, 2025

Pr/60 ealerskans/mllam-data-prep#2

Merged

matschreiner mentioned this issue Feb 14, 2025

Fix selection #70

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selection of input data along time coordinate fails #68

Selection of input data along time coordinate fails #68

observingClouds commented Feb 12, 2025

matschreiner commented Feb 12, 2025

matschreiner commented Feb 13, 2025

observingClouds commented Feb 13, 2025

matschreiner commented Feb 14, 2025

observingClouds commented Feb 15, 2025

Selection of input data along time coordinate fails #68

Selection of input data along time coordinate fails #68

Comments

observingClouds commented Feb 12, 2025

matschreiner commented Feb 12, 2025

matschreiner commented Feb 13, 2025

observingClouds commented Feb 13, 2025

matschreiner commented Feb 14, 2025

observingClouds commented Feb 15, 2025