Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Arrow Dictionary indices to not prefer Unsigned #17327

Open
karthikeyann opened this issue Nov 14, 2024 · 2 comments · May be fixed by #17390
Open

[BUG] Arrow Dictionary indices to not prefer Unsigned #17327

karthikeyann opened this issue Nov 14, 2024 · 2 comments · May be fixed by #17390
Assignees
Labels
bug Something isn't working

Comments

@karthikeyann
Copy link
Contributor

karthikeyann commented Nov 14, 2024

Describe the bug
During converting data between arrow and cudf,
from_arrow converts dictionary indices to unsigned types. libcudf dictionary indices become unsigned int even if arrow has signed int.

auto const dict_indices_type = [&schema]() -> data_type {
// cudf dictionary requires an unsigned type for the indices,
// since it is invalid for an arrow dictionary to contain negative
// indices, we can safely use the unsigned equivalent without having
// to modify the buffers.
switch (schema->storage_type) {
case NANOARROW_TYPE_INT8:
case NANOARROW_TYPE_UINT8: return data_type(type_id::UINT8);
case NANOARROW_TYPE_INT16:
case NANOARROW_TYPE_UINT16: return data_type(type_id::UINT16);
case NANOARROW_TYPE_INT32:
case NANOARROW_TYPE_UINT32: return data_type(type_id::UINT32);

But this causes issues with round tripping to exact types.
https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout:~:text=Since%20unsigned%20integers,by%20an%20application.
Also the arrow docs prefers to use signed types.

Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.

Starting a discussion here to update this preference.

Additional context
Velox does not support unsigned indices in dictionary column. while round tripping cudf table, this issue was found.

@karthikeyann karthikeyann added the bug Something isn't working label Nov 14, 2024
@davidwendt
Copy link
Contributor

I believe we could change libcudf dictionary to support signed indices only without breaking anyone honestly.
Or we could consider dropping dictionary support (throwing an exception) in the arrow interop since we cannot guarantee the type even works appropriately with all of libcudf at this point.

@karthikeyann
Copy link
Contributor Author

@davidwendt discussed and decided to switch dictionary column to signed indices.

@davidwendt davidwendt self-assigned this Nov 19, 2024
@davidwendt davidwendt linked a pull request Nov 20, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants