Dynamic time selection (#2055)

### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR fixes #1987 - [x] Tests for the changes have been added (for bug fixes / features) - [x] (If applicable) Documentation has been added / updated (for bug fixes / features) - [x] CHANGELOG.rst has been updated (with summary of main changes) - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added ### What kind of change does this PR introduce? The PR improves the `select_time` function providing the possibility to select time based on `xr.DataArray` of days of year. - Move `_get_doys` outside `select_time` - Create `mask_between_doys` function - Edit `select_time` to work with the new workflow to select time with doys ### Other information: I've run pytest but I had an error (ERROR tests/test_sdba/test_adjustment.py::TestLoci::test_reduce_dims) and a failed (FAILED tests/test_modules.py::test_encoding - AttributeError: module '_locale' has no attribute 'nl_langinfo') but I don't know how to fix this.
Ouranosinc · Feb 17, 2025 · 7cec9a4 · 7cec9a4
2 parents 84e9751 + a96ad8d
commit 7cec9a4
Show file tree

Hide file tree

Showing 10 changed files with 293 additions and 77 deletions.
diff --git a/.zenodo.json b/.zenodo.json
@@ -153,6 +153,11 @@
       "name": "Lehner, Sebastian",
       "affiliation": "GeoSphere Austria, Vienna, Austria",
       "orcid": "0000-0002-7562-8172"
+    },
+    {
+      "name": "Hamon, Baptiste",
+      "affiliation": "University of Canterbury, Christchurch, New Zealand",
+      "orcid": "0009-0007-4530-9772"
     }
   ],
   "keywords": [

diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -48,3 +48,4 @@ Contributors
 * Adrien Lamarche `@LamAdr <https://github.com/LamAdr>`_
 * Faisal Mahmood <[email protected]> <[email protected]> `@faimahsho <https://github.com/faimahsho>`_
 * Sebastian Lehner <[email protected]> `@seblehner <https://github.com/seblehner>`_
+* Baptiste Hamon <[email protected]> `@baptistehamon <https://github.com/baptistehamon>`_
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -4,7 +4,7 @@ Changelog
 
 v0.55.0 (unreleased)
 --------------------
-Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Trevor James Smith (:user:`Zeitsperre`), Sascha Hofmann (:user:`saschahofmann`), Pascal Bourgault (:user:`aulemahal`), Éric Dupuis (:user:`coxipi`).
+Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Trevor James Smith (:user:`Zeitsperre`), Sascha Hofmann (:user:`saschahofmann`), Pascal Bourgault (:user:`aulemahal`), Éric Dupuis (:user:`coxipi`), Baptiste Hamon (:user:`baptistehamon`).
 
 Breaking changes
 ^^^^^^^^^^^^^^^^
@@ -29,16 +29,18 @@ New features and enhancements
 * ``xclim.testing.helpers.test_timeseries`` now accepts a `calendar` argument that is forwarded to ``xr.cftime_range``. (:pull:`2019`).
 * New ``xclim.indices.fao_allen98``, exporting the FAO-56 Penman-Monteith equation for potential evapotranspiration (:issue:`2004`, :pull:`2067`).
 * Missing values method "pct" and "at_least_n" now accept a new "subfreq" option that allows to compute the missing mask in two steps. When given, the algorithm is applied at this "subfreq" resampling frequency first and then the result is resampled at the target indicator frequency. In the output, a period is invalid if any of its subgroup where flagged as invalid by the chosen method. (:pull:`2058`, :issue:`1820`).
-* `rv_continuous` functions can now be given directly as the `dist` argument in ``standardized_precipitation_index`` and ``standardized_precipitation_evapotranspiration_index`` indicators. This includes `lmoments3` distribution when specifying `method="PWM"`. (:issue:`2043`, :pull:`2045`).
+* ``rv_continuous`` functions can now be given directly as the ``dist`` argument in ``standardized_precipitation_index`` and ``standardized_precipitation_evapotranspiration_index`` indicators. This includes `lmoments3` distribution when specifying ``method="PWM"``. (:issue:`2043`, :pull:`2045`).
+* Time selection in ``xclim.core.calendar.select_time`` and the ``**indexer`` argument of indicators now supports day-of-year bounds given as DataArrays with spatial and/or temporal dimensions. (:issue:`1987`, :pull:`2055`).
 
 Internal changes
 ^^^^^^^^^^^^^^^^
 * `sphinx-codeautolink` and `pygments` have been temporarily pinned due to breaking API changes. (:pull:`2030`).
 * Adjusted the ``TestOfficialYaml`` test to use a dynamic method for finding the installed location of `xclim`. (:pull:`2028`).
 * Adjusted two tests for better handling when running in Windows environments. (:pull:`2057`).
-* Refactor of the ``xclim.core.missing`` module, usage of the ``Missing`` objects has been broken. (:pull:`2058`, :issue:`1820`, :issue:`2000`).
+* Refactor of the ``xclim.core.missing`` module, usage of the ``Missing`` objects has been broken. (:pull:`2058`, :pull:`2055`, :pull:`2076`, :issue:`1820`, :issue:`2000`).
     - Objects are initialized with their options and then called with the data, input frequency, target frequency and indexer.
     - Subclasses receive non-resampled DataArray in their ``is_missing`` methods.
+    - Subclasses receive the array of valid timesteps ``valid`` instead of ``null``, the invalid ones.
     - ``MissingWMO`` now uses ``xclim.indices.helpers.resample_map`` which should greatly improve performance in a dask context.
 * There is now a warning stating that `fitkwargs` are not employed when using the `lmoments3` distribution. One exception is the use of `'floc'` which is allowed with the gamma distribution. `'floc'` is used to shift the distribution before computing fitting parameters with the `lmoments3` distribution since ``loc=0`` is always assumed in the library. (:issue:`2043`, :pull:`2045`).
 

diff --git a/docs/notebooks/customize.ipynb b/docs/notebooks/customize.ipynb
@@ -157,7 +157,7 @@
     "\n",
     "Finally, it is possible for advanced users to register their own methods. Xclim's missing methods are in fact class-based. To create a custom missing class, one should implement a subclass of `xclim.core.checks.MissingBase` and override at least the `is_missing` method. This method should take the following arguments:\n",
     "\n",
-    "- `null`, a `DataArray` of the mask of invalid values in the input data array (with the same time coordinate as the raw data).\n",
+    "- `valid`, a `DataArray` of the mask of valid values in the input data array (with the same time coordinate as the raw data).\n",
     "- `count`, `DataArray` of the number of days in each resampled periods\n",
     "- `freq`, the resampling frequency.\n",
     "\n",
@@ -185,8 +185,9 @@
     "    def __init__(self, max_n: int = 5):\n",
     "        super().__init__(max_n=max_n)\n",
     "\n",
-    "    def is_missing(self, null, count, freq):\n",
+    "    def is_missing(self, valid, count, freq):\n",
     "        \"\"\"Return a boolean mask where True values are for elements that are considered missing and masked on the output.\"\"\"\n",
+    "        null = ~valid\n",
     "        return (\n",
     "            null.resample(time=freq).map(longest_run, dim=\"time\")\n",
     "            >= self.options[\"max_n\"]\n",

diff --git a/src/xclim/core/calendar.py b/src/xclim/core/calendar.py
@@ -1037,7 +1037,7 @@ def doy_to_days_since(
     # 2cases:
     # val is a day in the same year as its index : da - offset
     # val is a day in the next year : da + doy_max - offset
-    out = xr.where(dac > base_doy, dac, dac + doy_max) - start_doy
+    out = xr.where(dac >= base_doy, dac, dac + doy_max) - start_doy
     out.attrs.update(da.attrs)
     if start is not None:
         out.attrs.update(units=f"days after {start}")
@@ -1117,12 +1117,136 @@ def days_since_to_doy(
     return out.convert_calendar(base_calendar).rename(da.name)
 
 
+def _get_doys(start: int, end: int, inclusive: tuple[bool, bool]):
+    """
+    Get the day of year list from start to end.
+
+    Parameters
+    ----------
+    start : int
+        Start day of year.
+    end : int
+        End day of year.
+    inclusive : 2-tuple of booleans
+        Whether the bounds should be inclusive or not.
+
+    Returns
+    -------
+    np.ndarray
+        Array of day of year between the start and end.
+    """
+    if start <= end:
+        doys = np.arange(start, end + 1)
+    else:
+        doys = np.concatenate((np.arange(start, 367), np.arange(0, end + 1)))
+    if not inclusive[0]:
+        doys = doys[1:]
+    if not inclusive[1]:
+        doys = doys[:-1]
+    return doys
+
+
+def mask_between_doys(
+    da: xr.DataArray,
+    doy_bounds: tuple[int | xr.DataArray, int | xr.DataArray],
+    include_bounds: tuple[bool, bool] = [True, True],
+) -> xr.DataArray | xr.Dataset:
+    """
+    Mask the data outside the day of year bounds.
+
+    Parameters
+    ----------
+    da : xr.DataArray or xr.Dataset
+        Input data. It must have a time coordinate.
+    doy_bounds : 2-tuple of integers or DataArray
+        The bounds as (start, end) of the period of interest expressed in day-of-year, integers going from
+        1 (January 1st) to 365 or 366 (December 31st).
+        If DataArrays are passed, they must have the same coordinates on the dimensions they share.
+        They may have a time dimension, in which case the masking is done independently for each period defined by the coordinate,
+        which means the time coordinate must have an inferable frequency (see :py:func:`xr.infer_freq`).
+        Timesteps of the input not appearing in the time coordinate of the bounds are masked as "outside the bounds".
+        Missing values (nan) in the start and end bounds default to 1 and 366 respectively in the non-temporal case
+        and to open bounds (the start and end of the period) in the temporal case.
+    include_bounds : 2-tuple of booleans
+        Whether the bounds of `doy_bounds` should be inclusive or not.
+
+    Returns
+    -------
+    xr.DataArray
+        Boolean array with the same time coordinate as `da` and any other dimension present on the bounds.
+        True value inside the period of interest and False outside.
+    """
+    if isinstance(doy_bounds[0], int) and isinstance(doy_bounds[1], int):  # Simple case
+        mask = da.time.dt.dayofyear.isin(_get_doys(*doy_bounds, include_bounds))
+    else:
+        start, end = doy_bounds
+        # convert ints to DataArrays
+        if isinstance(start, int):
+            start = xr.full_like(end, start)
+        elif isinstance(end, int):
+            end = xr.full_like(start, end)
+        # Ensure they both have the same dims
+        # align join='exact' will fail on common but different coords, broadcast will add missing coords
+        start, end = xr.broadcast(*xr.align(start, end, join="exact"))
+
+        if not include_bounds[0]:
+            start += 1
+        if not include_bounds[1]:
+            end -= 1
+
+        if "time" in start.dims:
+            freq = xr.infer_freq(start.time)
+            # Convert the doy bounds to a duration since the beginning of each period defined in the bound's time coordinate
+            # Also ensures the bounds share the sime time calendar as the input
+            # Any missing value is replaced with the min/max of possible values
+            calkws = dict(
+                calendar=da.time.dt.calendar, use_cftime=(da.time.dtype == "O")
+            )
+            start = doy_to_days_since(start.convert_calendar(**calkws)).fillna(0)
+            end = doy_to_days_since(end.convert_calendar(**calkws)).fillna(366)
+
+            out = []
+            # For each period, mask the days since between start and end
+            for base_time, indexes in da.resample(time=freq).groups.items():
+                group = da.isel(time=indexes)
+
+                if base_time in start.time:
+                    start_d = start.sel(time=base_time)
+                    end_d = end.sel(time=base_time)
+
+                    # select days between start and end for group
+                    days = (group.time - base_time).dt.days
+                    days = days.where(days >= 0)
+                    mask = (days >= start_d) & (days <= end_d)
+                else:  # This group has no defined bounds : put False in the mask
+                    # Array with the same shape as the "mask" in the other case : broadcast of time and bounds dims
+                    template = xr.broadcast(
+                        group.time.dt.day, start.isel(time=0, drop=True)
+                    )[0]
+                    mask = xr.full_like(template, False, dtype="bool")
+                out.append(mask)
+            mask = xr.concat(out, dim="time")
+        else:  # Only "Spatial" dims, we can't constrain as in days since, so there are two cases
+            doys = da.time.dt.dayofyear  # for readability
+            # Any missing value is replaced with the min/max of possible values
+            start = start.fillna(1)
+            end = end.fillna(366)
+            mask = xr.where(
+                start <= end,
+                # case 1 : start <= end, ROI is within a calendar year
+                (doys >= start) & (doys <= end),
+                # case 2 : start >  end, ROI crosses the new year
+                ~((doys > end) & (doys < start)),
+            )
+    return mask
+
+
 def select_time(
     da: xr.DataArray | xr.Dataset,
     drop: bool = False,
     season: str | Sequence[str] | None = None,
     month: int | Sequence[int] | None = None,
-    doy_bounds: tuple[int, int] | None = None,
+    doy_bounds: tuple[int | xr.DataArray, int | xr.DataArray] | None = None,
     date_bounds: tuple[str, str] | None = None,
     include_bounds: bool | tuple[bool, bool] = True,
 ) -> DataType:
@@ -1138,14 +1262,16 @@ def select_time(
     da : xr.DataArray or xr.Dataset
         Input data.
     drop : bool
-        Whether to drop elements outside the period of interest or to simply mask them (default).
+        Whether to drop elements outside the period of interest (True) or to simply mask them (False, default).
+        This option is incompatible with passing array-like doy_bounds.
     season : str or sequence of str, optional
         One or more of 'DJF', 'MAM', 'JJA' and 'SON'.
     month : int or sequence of int, optional
         Sequence of month numbers (January = 1 ... December = 12).
-    doy_bounds : 2-tuple of int, optional
+    doy_bounds : 2-tuple of int or xr.DataArray, optional
         The bounds as (start, end) of the period of interest expressed in day-of-year, integers going from
-        1 (January 1st) to 365 or 366 (December 31st).
+        1 (January 1st) to 365 or 366 (December 31st). If a combination of int and xr.DataArray is given,
+        the int day-of-year corresponds to the year of the xr.DataArray.
         If calendar awareness is needed, consider using ``date_bounds`` instead.
     date_bounds : 2-tuple of str, optional
         The bounds as (start, end) of the period of interest expressed as dates in the month-day (%m-%d) format.
@@ -1187,17 +1313,6 @@ def select_time(
     if N == 0:
         return da
 
-    def _get_doys(_start, _end, _inclusive):
-        if _start <= _end:
-            _doys = np.arange(_start, _end + 1)
-        else:
-            _doys = np.concatenate((np.arange(_start, 367), np.arange(0, _end + 1)))
-        if not _inclusive[0]:
-            _doys = _doys[1:]
-        if not _inclusive[1]:
-            _doys = _doys[:-1]
-        return _doys
-
     if isinstance(include_bounds, bool):
         include_bounds = (include_bounds, include_bounds)
 
@@ -1212,7 +1327,15 @@ def _get_doys(_start, _end, _inclusive):
         mask = da.time.dt.month.isin(month)
 
     elif doy_bounds is not None:
-        mask = da.time.dt.dayofyear.isin(_get_doys(*doy_bounds, include_bounds))
+        if (
+            not (isinstance(doy_bounds[0], int) and isinstance(doy_bounds[1], int))
+            and drop
+        ):
+            # At least one of those is an array, this drop won't work
+            raise ValueError(
+                "Passing array-like doy bounds is incompatible with drop=True."
+            )
+        mask = mask_between_doys(da, doy_bounds, include_bounds)
 
     elif date_bounds is not None:
         # This one is a bit trickier.