-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ScalarValue::Dictionary #12488
Remove ScalarValue::Dictionary #12488
Conversation
`ScalarValue` should be a container for a single nullable logical type and should not be concerned by various physical encodings used in arrays. It doesn't involve arrays even as part of internal representation (like arrow `Scalar` does). Remove `Dictionary` option from the `ScalarValue`. Note that `ScalarValue` doesn't have mapping for RLE arrays, while there are very similar type to dictionaries. Having `ScalarValue` follow complexity of various physical representations poses a burden on UDF implementers. They need to handle all equivalent scalar values.
This won't pass the test yet, but creating the PR to have a conversation first. cc @alamb @comphead especially if we go with @notfilippo's #11513 , we need to have an answer to: what is a type? what isn't a type? what's logical and what's physical. Do we need scalar value to represent "would be dictionary but is single value" (being removed here)? To me those are properties of physical representation of series of values, that are not attributes of a single value, so ScalarValue doesn't have to be concerned about them. I think ScalarValue reflects some of them because we expect it to define type, and we expect type to sometimes define physical representation. This is blurry. Side note: |
Just posting it here for reference as this PR overlaps with some of the work in this other PR: #11978. I'm currently in the process of resolving merge conflicts accumulated during my vacation time. |
This is awesome @notfilippo! I wasn't aware of that work! In any case, how can i help with this work? |
Thanks @findepi
Definitely as it would make it easier to review. I opened the PR just to make sure that the same approach would be applicable as we remove more types (which i presume was very successful as we were able to identify some key issues that need attention). Once we reach consensus I was planning splitting my work into smaller PRs.
Would really appreciate to get more eyes on my PR :) |
From my point of view, there is pretty broad consensus that having ScalarValue mirror the physical representation (e.g. The challenge is, as @findepi implies, figuring out how to make the changes incrementally (both to keep the change possible to review as well as to give downstream consumers time to absorb the change) |
BTW I think the most important potential concern for this change is that arrays will be cast at runtime (e.g. when comparing a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @findepi and @notfilippo . TLDIR is I think this is absolutely the right direction.
Sorry for the delay in reviewing.
To me those are properties of physical representation of series of values, that are not attributes of a single value, so ScalarValue doesn't have to be concerned about them.
I agree
I think ScalarValue reflects some of them because we expect it to define type, and we expect type to sometimes define physical representation. This is blurry.
I also agree it is blurry. I believe that DataFusion uses the Arrow type system for logical types initially for convenience -- at that time there was only Dictionary
(REE, StringView were added later). So while it was annoying to handle one special case (Dictionary), it was manageable compared to all the other stuff going on
In my opinion if DataFusion was being implemented again starting today, it would not use Arrow::DataType directly for its logical type system but instead would have a more generic LogicalDataType
similarly to what is proposed by @notfilippo
As you show in this PR, even the one special case for Dictionary
is quite substantial. Also our experience adding support for StringView involved many changes that have nothing to do with the physical representation (#12503 from @my-vegetable-has-exploded just this morning is a good example)
if we want scalar value to represent all possible aspects of array representation then arrow's builtin Scalar seems to be ready for that.
Yes, I agree. The reason that DataFusion doesn't use the arrow Datum yet is twofold from my perspective:
ScalarValue
predatesDatum
soDatum
wasn't an optionDatum
(single row array) is quite a bit less efficient thanScalarValue
(e.g. a single row StringArray will have several allocations for buffers, offsets, etc)
@findepi Note that ScalarValue in #11978 is not the final state of #11513, it is a step toward the #11513. I think we could have LogicalType without any Arrow's DataType contained in it in the future, and it would be the ideal type what you mentioned. |
for function execution that sounds like a dead end. in my free time i am trying to cook something for function calls. |
@notfilippo @alamb do you see any value in this PR, or should we just close it? |
I am not sure about the code in the PR, but the idea seems like a step in the right direction (after #12536) |
ScalarValue
should be a container for a single nullable logical type and should not be concerned by various physical encodings used in arrays. It doesn't involve arrays even as part of internal representation (like arrowScalar
does).Remove
Dictionary
option from theScalarValue
. Note thatScalarValue
doesn't have mapping for RLE arrays, while there are very similar type to dictionaries.Having
ScalarValue
follow complexity of various physical representations poses a burden on UDF implementers. They need to handle all equivalent scalar values.