REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

mroeschke · 2022-07-28T20:44:22Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

xref data-apis/dataframe-api#74 (comment)

cc @vnlitvinov @jorisvandenbossche

jreback · 2022-07-30T00:45:14Z

no tests needed updating?

mroeschke · 2022-08-01T17:57:34Z

no tests needed updating?

Looks like there was no existing test so added one

jorisvandenbossche · 2022-08-01T21:01:05Z

pandas/core/interchange/column.py

@@ -38,7 +38,7 @@

 _NULL_DESCRIPTION = {
    DtypeKind.FLOAT: (ColumnNullType.USE_NAN, None),
-    DtypeKind.DATETIME: (ColumnNullType.USE_NAN, None),
+    DtypeKind.DATETIME: (ColumnNullType.USE_SENTINEL, pd.NaT),


Not fully sure how this value will get used in practice, but should the sentinel being described in terms of the raw storage type? (which is a 64bit integer)

cc @rgommers

Fair point. Since this value is public in Column.describe_null which states

Value : if kind is "sentinel value", the actual value.

I would interpret "actual" as the implementation value (pd.NaT), but I can also see how this could be the int64 value too.

Yeah, I think you could also say that pd.NaT is the public facing value, while the actual implementation value is the int, depending on how you interpret those terms.

So therefore I wanted to check it with actual code: what's the value in the actual buffer, and does it equate with the sentinel?
However, that doesn't seem easy (how should one interpret the Buffer of a Datetime column as a generic array?), and this might also be under-specified (which is more an issue for https://github.com/data-apis/dataframe-api).
It seems the specification doesn't explicitly mention how the buffer of a Datetime column should be interpreted, except for the usage of the Arrow string format, so which implicitly indicates that it's indeed a signed integer (numpy's datetime64 is also basically an integer array, and I assume all involved packages use the same logic, so this might seem self-evident, but there are other ways to store datetimes).

I tried converting the buffer to a numpy array through DLPack, but that raises an error for a datetime column (only supports int/float). If we see the buffer as a datetime64 array, then pd.NaT is probably correct, if we see it as an int64 array, maybe the integer value is more correct.

Anyway, this is more something to clarify on the Data API side, so doesn't need to block this PR

It is going to be used to compare to the data buffer from the same column, so it looks to me like the only useful thing is the actual implementation value, not pd.NaT.

but there are other ways to store datetimes).

That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.

@honno do you have a test for dataframes with a datetime column already? If not, can you add one? EDIT: not roundtripping twice, but just to another library and testing there that the values and nulls are correct.

@honno do you have a test for dataframes with a datetime column already? If not, can you add one?

Generating dataframes with datetime columns was supported, but it seemed like every adopter had intentionally left it as a TODO in their interchange implementations (e.g. I'd get NotImplementedError whenever trying to interchange them), so I disabled them. I'll write a tracker issue on the test suite's repo to see where we're at.

but there are other ways to store datetimes).

That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.

Yes, that's certainly true for the sentinel. But I was thinking more broadly than the original sentinel discussion, in general about how to interpret the buffer of a Datetime column. Because the Arrow string format is used to describe the concrete type, people can assume it follows the the definitions of Arrow (thus int64 or int32), but maybe we should more explicitly mention that in the specification.

jorisvandenbossche · 2022-08-02T08:27:24Z

pandas/tests/interchange/test_impl.py

+    assert col.size == 2
+    assert col.null_count == 1
+    assert col.dtype[0] == DtypeKind.DATETIME
+    assert col.describe_null == (ColumnNullType.USE_SENTINEL, pd.NaT)


Can you also test the roundtrip here to ensure we correctly handle it in from_dataframe? (unless that is already tested elsewhere for NaT?)

(although we currently ignore the describe_null indication for datetime anyway .. Which we maybe should update to error if it is anything else?)

Sure, added the roundtrip test for this dataframe with pd.NaT.

Currently, looks like we raise NotImplementedError if the the null type is not USE_SENTINEL, USE_BITMASK, USE_BYTEMASK, NON_NULLABLE, or USE_NAN

mroeschke · 2022-08-09T05:02:38Z

Thanks for the feedback everyone. For the pandas implementation, I changed our sentinel to use the integer representation of pd.NaT

…ev#47887) * REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel * Add test * Add roundtrip test and change to use iNaT

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel

c687363

mroeschke added the Interchange Dataframe Interchange Protocol label Jul 28, 2022

mroeschke added this to the 1.5 milestone Jul 28, 2022

Merge remote-tracking branch 'upstream/main' into exchange/nat

18c249f

mroeschke added 2 commits August 1, 2022 10:33

Merge remote-tracking branch 'upstream/main' into exchange/nat

ff04fee

Add test

3ad429e

jorisvandenbossche reviewed Aug 1, 2022

View reviewed changes

jorisvandenbossche reviewed Aug 2, 2022

View reviewed changes

mroeschke mentioned this pull request Aug 8, 2022

Declare enums explicitly, fix type hints data-apis/dataframe-api#74

Merged

mroeschke added 2 commits August 8, 2022 21:40

Merge remote-tracking branch 'upstream/main' into exchange/nat

1b5be67

Add roundtrip test and change to use iNaT

c8075c0

jorisvandenbossche approved these changes Aug 10, 2022

View reviewed changes

mroeschke merged commit 5b42542 into pandas-dev:main Aug 10, 2022

mroeschke deleted the exchange/nat branch August 10, 2022 18:33

honno mentioned this pull request Sep 5, 2022

BUG: Interchange Column.size is a property, not a method #48392

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

mroeschke commented Jul 28, 2022

jreback commented Jul 30, 2022

mroeschke commented Aug 1, 2022

jorisvandenbossche Aug 1, 2022

jorisvandenbossche Aug 1, 2022

mroeschke Aug 1, 2022 •

edited

Loading

jorisvandenbossche Aug 2, 2022

rgommers Aug 2, 2022

rgommers Aug 2, 2022 •

edited

Loading

honno Aug 2, 2022

jorisvandenbossche Aug 2, 2022

jorisvandenbossche Aug 2, 2022

mroeschke Aug 9, 2022

mroeschke commented Aug 9, 2022

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

Conversation

mroeschke commented Jul 28, 2022

jreback commented Jul 30, 2022

mroeschke commented Aug 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 9, 2022

mroeschke Aug 1, 2022 •

edited

Loading

rgommers Aug 2, 2022 •

edited

Loading