Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel #47887

Merged
merged 6 commits into from
Aug 10, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke added the Interchange Dataframe Interchange Protocol label Jul 28, 2022
@mroeschke mroeschke added this to the 1.5 milestone Jul 28, 2022
@jreback
Copy link
Contributor

jreback commented Jul 30, 2022

no tests needed updating?

@mroeschke
Copy link
Member Author

no tests needed updating?

Looks like there was no existing test so added one

@@ -38,7 +38,7 @@

_NULL_DESCRIPTION = {
DtypeKind.FLOAT: (ColumnNullType.USE_NAN, None),
DtypeKind.DATETIME: (ColumnNullType.USE_NAN, None),
DtypeKind.DATETIME: (ColumnNullType.USE_SENTINEL, pd.NaT),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully sure how this value will get used in practice, but should the sentinel being described in terms of the raw storage type? (which is a 64bit integer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@mroeschke mroeschke Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Since this value is public in Column.describe_null which states

Value : if kind is "sentinel value", the actual value.

I would interpret "actual" as the implementation value (pd.NaT), but I can also see how this could be the int64 value too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you could also say that pd.NaT is the public facing value, while the actual implementation value is the int, depending on how you interpret those terms.

So therefore I wanted to check it with actual code: what's the value in the actual buffer, and does it equate with the sentinel?
However, that doesn't seem easy (how should one interpret the Buffer of a Datetime column as a generic array?), and this might also be under-specified (which is more an issue for https://github.com/data-apis/dataframe-api).
It seems the specification doesn't explicitly mention how the buffer of a Datetime column should be interpreted, except for the usage of the Arrow string format, so which implicitly indicates that it's indeed a signed integer (numpy's datetime64 is also basically an integer array, and I assume all involved packages use the same logic, so this might seem self-evident, but there are other ways to store datetimes).

I tried converting the buffer to a numpy array through DLPack, but that raises an error for a datetime column (only supports int/float). If we see the buffer as a datetime64 array, then pd.NaT is probably correct, if we see it as an int64 array, maybe the integer value is more correct.

Anyway, this is more something to clarify on the Data API side, so doesn't need to block this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to be used to compare to the data buffer from the same column, so it looks to me like the only useful thing is the actual implementation value, not pd.NaT.

but there are other ways to store datetimes).

That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.

Copy link
Contributor

@rgommers rgommers Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@honno do you have a test for dataframes with a datetime column already? If not, can you add one? EDIT: not roundtripping twice, but just to another library and testing there that the values and nulls are correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@honno do you have a test for dataframes with a datetime column already? If not, can you add one?

Generating dataframes with datetime columns was supported, but it seemed like every adopter had intentionally left it as a TODO in their interchange implementations (e.g. I'd get NotImplementedError whenever trying to interchange them), so I disabled them. I'll write a tracker issue on the test suite's repo to see where we're at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there are other ways to store datetimes).

That probably does not matter, because whatever the format for the data buffer is, this sentinel value should always be the exact same thing.

Yes, that's certainly true for the sentinel. But I was thinking more broadly than the original sentinel discussion, in general about how to interpret the buffer of a Datetime column. Because the Arrow string format is used to describe the concrete type, people can assume it follows the the definitions of Arrow (thus int64 or int32), but maybe we should more explicitly mention that in the specification.

assert col.size == 2
assert col.null_count == 1
assert col.dtype[0] == DtypeKind.DATETIME
assert col.describe_null == (ColumnNullType.USE_SENTINEL, pd.NaT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test the roundtrip here to ensure we correctly handle it in from_dataframe? (unless that is already tested elsewhere for NaT?)

(although we currently ignore the describe_null indication for datetime anyway .. Which we maybe should update to error if it is anything else?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added the roundtrip test for this dataframe with pd.NaT.

Currently, looks like we raise NotImplementedError if the the null type is not USE_SENTINEL, USE_BITMASK, USE_BYTEMASK, NON_NULLABLE, or USE_NAN

@mroeschke
Copy link
Member Author

Thanks for the feedback everyone. For the pandas implementation, I changed our sentinel to use the integer representation of pd.NaT

@mroeschke mroeschke merged commit 5b42542 into pandas-dev:main Aug 10, 2022
@mroeschke mroeschke deleted the exchange/nat branch August 10, 2022 18:33
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…ev#47887)

* REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel

* Add test

* Add roundtrip test and change to use iNaT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interchange Dataframe Interchange Protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants