-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887
Conversation
no tests needed updating? |
Looks like there was no existing test so added one |
pandas/core/interchange/column.py
Outdated
@@ -38,7 +38,7 @@ | |||
|
|||
_NULL_DESCRIPTION = { | |||
DtypeKind.FLOAT: (ColumnNullType.USE_NAN, None), | |||
DtypeKind.DATETIME: (ColumnNullType.USE_NAN, None), | |||
DtypeKind.DATETIME: (ColumnNullType.USE_SENTINEL, pd.NaT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully sure how this value will get used in practice, but should the sentinel being described in terms of the raw storage type? (which is a 64bit integer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rgommers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Since this value is public in Column.describe_null
which states
Value : if kind is "sentinel value", the actual value.
I would interpret "actual" as the implementation value (pd.NaT
), but I can also see how this could be the int64 value too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you could also say that pd.NaT is the public facing value, while the actual implementation value is the int, depending on how you interpret those terms.
So therefore I wanted to check it with actual code: what's the value in the actual buffer, and does it equate with the sentinel?
However, that doesn't seem easy (how should one interpret the Buffer of a Datetime column as a generic array?), and this might also be under-specified (which is more an issue for https://github.com/data-apis/dataframe-api).
It seems the specification doesn't explicitly mention how the buffer of a Datetime column should be interpreted, except for the usage of the Arrow string format, so which implicitly indicates that it's indeed a signed integer (numpy's datetime64 is also basically an integer array, and I assume all involved packages use the same logic, so this might seem self-evident, but there are other ways to store datetimes).
I tried converting the buffer to a numpy array through DLPack, but that raises an error for a datetime column (only supports int/float). If we see the buffer as a datetime64 array, then pd.NaT is probably correct, if we see it as an int64 array, maybe the integer value is more correct.
Anyway, this is more something to clarify on the Data API side, so doesn't need to block this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to be used to compare to the data buffer from the same column, so it looks to me like the only useful thing is the actual implementation value, not pd.NaT
.
but there are other ways to store datetimes).
That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@honno do you have a test for dataframes with a datetime column already? If not, can you add one? EDIT: not roundtripping twice, but just to another library and testing there that the values and nulls are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@honno do you have a test for dataframes with a datetime column already? If not, can you add one?
Generating dataframes with datetime columns was supported, but it seemed like every adopter had intentionally left it as a TODO in their interchange implementations (e.g. I'd get NotImplementedError
whenever trying to interchange them), so I disabled them. I'll write a tracker issue on the test suite's repo to see where we're at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there are other ways to store datetimes).
That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.
Yes, that's certainly true for the sentinel. But I was thinking more broadly than the original sentinel discussion, in general about how to interpret the buffer of a Datetime column. Because the Arrow string format is used to describe the concrete type, people can assume it follows the the definitions of Arrow (thus int64 or int32), but maybe we should more explicitly mention that in the specification.
assert col.size == 2 | ||
assert col.null_count == 1 | ||
assert col.dtype[0] == DtypeKind.DATETIME | ||
assert col.describe_null == (ColumnNullType.USE_SENTINEL, pd.NaT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test the roundtrip here to ensure we correctly handle it in from_dataframe? (unless that is already tested elsewhere for NaT?)
(although we currently ignore the describe_null indication for datetime anyway .. Which we maybe should update to error if it is anything else?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added the roundtrip test for this dataframe with pd.NaT
.
Currently, looks like we raise NotImplementedError
if the the null type is not USE_SENTINEL, USE_BITMASK, USE_BYTEMASK, NON_NULLABLE, or USE_NAN
Thanks for the feedback everyone. For the pandas implementation, I changed our sentinel to use the integer representation of |
…ev#47887) * REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel * Add test * Add roundtrip test and change to use iNaT
xref data-apis/dataframe-api#74 (comment)
cc @vnlitvinov @jorisvandenbossche