Dealing with netcdf time:units with unexpected characters #296

durack1 · 2022-10-08T03:32:21Z

I've found a delightful edge case that is a little hard to believe. It involves a netcdf time:units that includes a character outside of the [0-9,-] range. If it's not obvious from the below, the issue is that the time:units = "days since 20O1-1-1" whereas this should be time:units = "days since 2001-1-1" (so replacing the rogue "O" (oooh), with the numeral "0" zero).

The file is a 297MiB file downloadable from here

Below is the example reproducing the error:

Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> xr.__version__
'2022.6.0'
>>> import cftime
>>> import cftime as cft
>>> cft.__version__
'1.6.1'
>>> ds = xr.open_dataset("/p/css03/esgf_publish/cmip3/ipcc/data3/sresa2/ice/mo/sic/ingv_echam4/run1/sic_O1.nc")
Traceback (most recent call last):
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 270, in decode_cf_datetime
    dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 207, in _decode_datetime_with_pandas
    raise OutOfBoundsDatetime(
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Cannot decode times from a non-standard calendar, '360_day', using pandas.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 180, in _decode_cf_datetime_dtype
    result = decode_cf_datetime(example_value, units, calendar, use_cftime)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 272, in decode_cf_datetime
    dates = _decode_datetime_with_cftime(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 201, in _decode_datetime_with_cftime
    cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
  File "src/cftime/_cftime.pyx", line 549, in cftime._cftime.num2date
  File "src/cftime/_cftime.pyx", line 107, in cftime._cftime._dateparse
  File "src/cftime/_cftime.pyx", line 750, in cftime._cftime._parse_date
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "~mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 569, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/store.py", line 29, in open_dataset
    vars, attrs, coord_names = conventions.decode_cf_variables(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 521, in decode_cf_variables
    new_vars[k] = decode_cf_variable(
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 369, in decode_cf_variable
    var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 682, in decode
    dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
  File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 190, in _decode_cf_datetime_dtype
    raise ValueError(msg)
ValueError: unable to decode time units 'days since 20O1-1-1' with "calendar '360_day'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

I wonder if a regex check would be useful to implement? This problem tripped me up for a while, and it was not at all obvious that an incorrect character (which looks almost identical, depending on fonts) was the root cause. Testing for a datestring that matches regex r"(?:[0-9][0-9])?[0-9][0-9]-(?:[0-1])?[0-9]-(?:[0-3])?[0-9]" could be a useful test to catch such a fringe case - and point out the issue obviously in the error message. It seems in the CF Conventions docs that there is little leeway in this format, so using "/" or alternative MM-DD-YYYY formats to the standard [YYY]Y-[M]M-[D]D HH:MM:SS.ss [-]0:00

And just because pydata/xarray#7144

The text was updated successfully, but these errors were encountered:

jswhit · 2022-10-11T17:33:10Z

A PR that catches this, and adds more leeway in the formatting (consistent with CF) would be welcome.

durack1 · 2022-10-11T20:59:12Z

@jswhit I am a little confused how the regex following the CF standards at https://github.com/Unidata/cftime/blob/master/src/cftime/_cftime.pyx#L45-L47 isn't catching this, it's an out-of-bounds character

I'm unfamiliar with this library, so would need some pointers to get started

jswhit · 2022-10-11T21:08:59Z

Sorry, can't be of much help - as the comment says this regex was lifted from http://delete.me.uk/2005/03/iso8601.html but apparently that link no longer exists. I know almost nothing about regexes.

davidhassell · 2022-10-12T09:46:51Z

Hi - I've looked at this regex in the (dim and distant) past - happy to have take a look here, if that's OK.

The quick test below shows the regex passing in the "letter O" case, but has too many Nones in its matched groups. I'll have a look at whether the regex could/should be updated, or if a check in cpdef _parse_date(datestring) (https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L750), which is where the regex is used, might be an alternative.

>>> import re
>>> ISO8601_REGEX = re.compile(r"(?P<year>[+-]?[0-9]+)(-(?P<month>[0-9]{1,2})(-(?P<day>[0-9]{1,2})"
                           r"(((?P<separator1>.)(?P<hour>[0-9]{1,2}):(?P<minute>[0-9]{1,2})(:(?P<second>[0-9]{1,2})(\.(?P<fraction>[0-9]+))?)?)?"
                           r"((?P<separator2>.?)(?P<timezone>Z|(([-+])([0-9]{2})((:([0-9]{2}))|([0-9]{2}))?)))?)?)?)?"
                           )  # From https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L45-L48
>>> ISO8601_REGEX.match('2001-01-01').groups()  # All numbers
('2001',
 '-01-01',
 '01',
 '-01',
 '01',
 '',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None)
>>> ISO8601_REGEX.match('20O1-01-01').groups()  # Letter O
('20',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None)

durack1 · 2022-10-12T17:28:03Z

Thanks @davidhassell! I hit some weird duplication issue in more simple regex I was using, it validated correctly, but then seemed to cycle over things again and again - https://regex101.com/ was really helpful in pointing that out to me

durack1 changed the title ~~Dealing with highly unusual netcdf time:units with unexpected characters~~ Dealing with netcdf time:units with unexpected characters Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with netcdf time:units with unexpected characters #296

Dealing with netcdf time:units with unexpected characters #296

durack1 commented Oct 8, 2022 •

edited

Loading

jswhit commented Oct 11, 2022

durack1 commented Oct 11, 2022

jswhit commented Oct 11, 2022

davidhassell commented Oct 12, 2022

durack1 commented Oct 12, 2022

Dealing with netcdf time:units with unexpected characters #296

Dealing with netcdf time:units with unexpected characters #296

Comments

durack1 commented Oct 8, 2022 • edited Loading

jswhit commented Oct 11, 2022

durack1 commented Oct 11, 2022

jswhit commented Oct 11, 2022

davidhassell commented Oct 12, 2022

durack1 commented Oct 12, 2022

durack1 commented Oct 8, 2022 •

edited

Loading