Relax nanosecond datetime restriction in CF time decoding #9618

kmuehlbauer · 2024-10-13T19:55:39Z

Closes Interoperability with Pandas 2.0 non-nanosecond datetime #7493
Closes DataArray constructor still coerces to np.datetime64[ns], not cftime in 0.11.0 #2587
Tests added/changed
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This is another attempt to resolve #7493. This goes a step further than #9580.

The idea of this PR is to automatically infer the needed resolutions for decoding/~~encoding~~ and only keep the constraints pandas imposes ("s" - lowest resolution, "ns" - highest resolution). There is still the idea of a default resolution, but this should only take precedence if it doesn't clash with the automatic inference. This can be discussed, though. Update: I've implemented time-unit-kwarg ~~a first try to have default resolution~~ on decode, which will override the current inferred resolution only to higher resolution (eg. 's' -> 'ns').

For sanity checking, and also for my own good, I've created a documentation page on time-coding in the internal dev section. Any suggestions (especially grammar) or ideas for enhancements are much appreciated.

There still might be room for consolidation of functions/methods (mostly in coding/times.py), but I have to leave it alone for some days. I went down that rabbit hole and need to relax, too 😬.

Looking forward to get your insights here, @spencerkclark, @ChrisBarker-NOAA, @pydata/xarray.

Todo:

floating point handling
Handling in Variable constructor
update decoding tests to iterate over time_units (where appropriate)
...

kmuehlbauer · 2024-10-14T12:13:07Z

Nice, mypy 1.12 is out and breaks our typing, 😭.

TomNicholas · 2024-10-14T15:16:04Z

Nice, mypy 1.12 is out and breaks our typing, 😭

Can we pin it in the CI temporarily?

kmuehlbauer · 2024-10-14T15:28:08Z

Can we pin it in the CI temporarily?

Yes, 1.11.2 was the last version.

kmuehlbauer · 2024-10-14T19:30:32Z

This is now ready for a first round of review. I think this is already in a quite usable state.

But no rush, this should be thoroughly tested.

spencerkclark · 2024-10-18T01:46:23Z

Sounds good @kmuehlbauer! I’ll try and take an initial look this weekend.

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

…ocessing, raise now early

…_ref_date

…o fix mypy

…t resolution, fix code and tests to allow this

ChrisBarker-NOAA · 2024-11-18T22:56:21Z

create a nice example how to handle these difficulties?

Sure -- where would be a good home for that?

kmuehlbauer · 2024-11-18T23:06:13Z

Not sure, but https://docs.xarray.dev/en/stable/user-guide/time-series.html could have a dedicated floating point date section.

kmuehlbauer · 2024-11-21T16:04:13Z

I've added a kwarg time_unit into the decode_cf and subsequent functionality.

But instead of adding that kwarg we could slightly overload the decode_times to take one of "s", "ms", "us", "ns" with "ns" as default.

This would have the positive effect, that we wouldn't need the additional kwarg and have to distribute it through the backends.

decode_times=None - directs to decode_times=True
decode_times=False - no decoding
decode_times=True - decode times with default value ("ns")
decode_times="s" - decode times to at least "s"
decode_times="ms" - decode times to at least "ms"
decode_times="us" - decode times to at least "us"
decode_times="ns" - decode times to "ns"

We could guard decode_times=None and decode_times=True with a DeprecationWarning and add our new defaults in the WarningMessage (eg. "us").

This methodology would be fully backwards compatible. It advertises the change via DeprecationWarning in normal operation and also if issues appear in the decoding steps.

If this is something which makes sense @shoyer, @dcherian, @spencerkclark, I'd add the needed changes to this PR.

dcherian · 2024-11-22T00:35:14Z

Alternatively, we could make small progress on #4490 and have

from xarray.coding import DatetimeCoder

ds = xr.open_mfdataset(..., decode_times=DatetimeCoder(units="ms"))

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes
The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’),

kmuehlbauer · 2024-11-22T08:35:45Z

Alternatively, we could make small progress on #4490 and have
from xarray.coding import DatetimeCoder

ds = xr.open_mfdataset(..., decode_times=DatetimeCoder(units="ms"))

This took a while to sink in 😉 Yes, that's a neat move. I'll incorporate this suggestion.

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

As long as we use pd.Timestamp for parsing the time unit specification (eg. seconds since 1992-10-8 15:15:42.5 -6:00) we can only do this for those units pd.Timestamp supports (‘s’, ‘ms’, ‘us’, and ‘ns’). We could add some code that checks if the data can be represented in "days" or "hours" (as given in the time unit specification) and convert after the parsing. Not sure how much there is involved. And this won't work for indexes, as those are restricted to (‘s’, ‘ms’, ‘us’, and ‘ns’).

spencerkclark · 2024-11-22T11:53:07Z

+1 for fewer arguments to open_dataset, however that is achieved!

Indeed for those practical reasons I do not think it is worth trying to match the on-disk units of integer data any more closely. Second precision already allows for a time span of roughly +/- 290 billion years (many times older than the Earth), which I think is plenty for most applications :).

Monthly or yearly units are also somewhat awkward to deal with due to their different (albeit often violated) definition in the CF conventions.

qq492947833 · 2024-11-24T06:59:58Z

Same question，the CMIP6 data have 2300 year dataset，but now xarray can just read data before 2262.So if you can fix this problem,it will be helpful to us.Thanks a lot!

for more information, see https://pre-commit.ci

ChrisBarker-NOAA · 2024-11-25T18:24:49Z

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

While appealing, I think this is not a good idea. A couple points:

NOTE: I got a bit lost in the discussion, yoy all may have already come to these same conclusions, but I thought I 'd capture it here in one big post ;-)

"months" and "years" are NOT recommended by CF, as they are not clearly defined timespans. (thought UDUNITS does have a definition for them, which is the average -- e.g. 365.25 days to the year (or thereabouts...)

As for days (or even hours):

most of the time, these are used with a floating point type anyway.
if an integer type -- that does not mean that the unit is not necessarily the required / desired precision.

One of the limitations with the CF encoding of time is that it it's inherently a continuum (to the precision used), and that doesn't change with the units.
For instance, a user might want daily data (maybe a daily average temperature) or monthly, or ... -- but there is no way to actually express that directly with a CF time [*]. That is:
unit: "days since 01-01-2024"
values: [0, 1, 2, 3, 4]

looks like it's it's expressing Jan 1, 2, 3, 4, ....

But what that actually means is:

01-01-2024T00:00:00
01-02-2024T00:00:00
01-03-2024T00:00:00
01-04-2024T00:00:00
...

That is, the zeroth time of each day -- i.e. a specific point in the time continuum.

And this maps to what all (that I know of) the datetime objects do too (python datetime.date notwithstanding -- and I don't think it does months)

if a user does want to have a way to express the "day", they might do:

unit: "days since 01-01-2024T12:00:00"
values: [0, 1, 2, 3, 4]

That is, noon of each day.

But then we can't use days as the unit with the fixed epoch of numpy datetime64.

Anyway, all this to say -- I don't think that there is ever a use case for using numpy datetime units longer than a second, certainly not by assuming something from the units of the time.

Using seconds as a default for any encoding of seconds or longer seems reasonable to me, though. But is there any real loss to using milliseconds?

[*] The way to express, e.g. a daily average, is to use "cell bounds", specifically defining the bounds of the average.

kmuehlbauer · 2024-11-25T18:37:22Z

Thanks @ChrisBarker-NOAA, I think we should move all these valuable comments in this PR into the docs somehow. I can take a look when this one is finalized.

rabernat · 2024-11-25T18:40:51Z

Properly supporting datetime intervals (rather than just instants) feels like it would solve so many semantic problems. We've been discussing that for years. I hope that it's now feasible post custom indexes refactor. But that's probably off topic for this thread...

ChrisBarker-NOAA · 2024-11-25T18:48:40Z

Properly supporting datetime intervals (rather than just instants) feels like it would solve so many semantic problem

Absolutely -- but yes, a whole other topic :-)

TomNicholas mentioned this pull request Oct 14, 2024

Reimplement Datatree typed ops #9619

Merged

4 tasks

kmuehlbauer force-pushed the any-time-resolution-2 branch from ca5050d to f7396cf Compare October 14, 2024 16:09

kmuehlbauer marked this pull request as ready for review October 14, 2024 18:05

kmuehlbauer mentioned this pull request Oct 15, 2024

Implement default time resolution for CF time encoding/decoding #9580

Closed

4 tasks

kmuehlbauer added topic-CF conventions topic-cftime run-upstream Run upstream CI labels Oct 16, 2024

kmuehlbauer and others added 18 commits October 18, 2024 07:31

implement default_precision_timestamp, refactor coding/times.py and c…

7b5f323

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

align tests with new time resolution behaviour

8784f33

timedelta decoding, fsspec handling

b45ab23

fixes in coding/times.py

39086ef

add docs on time coding

df49a40

attempt fixing doc tests

adb8ca3

fix issue where out-of-bounds floating point values slipped in the pr…

266b1ed

…ocessing, raise now early

convert to UTC first before stripping of tz in _unpack_time_units_and…

6d5f13b

…_ref_date

reorganize pandas compatibility code, remove unneeded code, attempt t…

5d68bfe

…o fix mypy

another attempt to finally fix mypy

07bba69

refactor out _check_date_is_after_shift

6e7f0bb

refactor out _maybe_strip_tz_from_timestamp

b4a49bb

more refactoring in coding.times.py

2e1ff4f

more refactoring in coding.times.py

d5a7da0

minor fix in time-coding.rst

821b68d

set default resolution to "s", which actually means, use pandas lowes…

d066edf

…t resolution, fix code and tests to allow this

Add section for default units, fix options

ed22da1

attempt to fix typing

8bf23f4

kmuehlbauer and others added 10 commits November 19, 2024 13:15

more refactoring, update tests

0efbbeb

add fixture, apply fixture to more tests.

2910250

update time-coding.rst

57d8d72

fix typing

5333240

try to fix test, remove stale print

6f35c81

another attempt to fix test

d0c17a4

debug failing test

b2b6bb1

refactor cftime fallback in datetime decoding

5dbc8a7

Merge branch 'main' into any-time-resolution-2

be0d3e0

fix merge-collission

f95408a

kmuehlbauer changed the title ~~Relax nanosecond datetime restriction in CF time coding~~ Relax nanosecond datetime restriction in CF time decoding Nov 21, 2024

Merge branch 'main' into any-time-resolution-2

609e15c

use CFDatetimeCoder instance to transport unit/use_cftime

ec7f165

kmuehlbauer and others added 3 commits November 25, 2024 16:09

decode_times with CFDatetimeCoder

1f1cf1c

[pre-commit.ci] auto fixes from pre-commit.com hooks

14b1a88

for more information, see https://pre-commit.ci

Merge branch 'main' into any-time-resolution-2

05627dd

kmuehlbauer added 3 commits November 26, 2024 08:33

fix mypy, warning/error

e7cbf3a

api, docs, docstrings

fc87e04

Merge branch 'main' into any-time-resolution-2

9ae645e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax nanosecond datetime restriction in CF time decoding #9618

Relax nanosecond datetime restriction in CF time decoding #9618

kmuehlbauer commented Oct 13, 2024 •

edited

Loading

kmuehlbauer commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

spencerkclark commented Oct 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

kmuehlbauer commented Nov 21, 2024

dcherian commented Nov 22, 2024

kmuehlbauer commented Nov 22, 2024

spencerkclark commented Nov 22, 2024

qq492947833 commented Nov 24, 2024 •

edited

Loading

ChrisBarker-NOAA commented Nov 25, 2024

kmuehlbauer commented Nov 25, 2024

rabernat commented Nov 25, 2024

ChrisBarker-NOAA commented Nov 25, 2024

Relax nanosecond datetime restriction in CF time decoding #9618

Are you sure you want to change the base?

Relax nanosecond datetime restriction in CF time decoding #9618

Conversation

kmuehlbauer commented Oct 13, 2024 • edited Loading

kmuehlbauer commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

spencerkclark commented Oct 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

kmuehlbauer commented Nov 21, 2024

dcherian commented Nov 22, 2024

kmuehlbauer commented Nov 22, 2024

spencerkclark commented Nov 22, 2024

qq492947833 commented Nov 24, 2024 • edited Loading

ChrisBarker-NOAA commented Nov 25, 2024

kmuehlbauer commented Nov 25, 2024

rabernat commented Nov 25, 2024

ChrisBarker-NOAA commented Nov 25, 2024

kmuehlbauer commented Oct 13, 2024 •

edited

Loading

qq492947833 commented Nov 24, 2024 •

edited

Loading