-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ScalarValue::eq_array optimized comparison function #844
Conversation
@@ -277,6 +277,31 @@ impl std::hash::Hash for ScalarValue { | |||
} | |||
} | |||
|
|||
// return the index into the dictionary values for array@index as well | |||
// as a reference to the dictionary values array. Returns None for the | |||
// index if the array is NULL at index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/arrow-rs#672 proposes adding this upstream in arrow
I think this properly handles null
values now in the DictionaryArray, whereas y initial version did not
@@ -973,22 +1011,106 @@ impl ScalarValue { | |||
}) | |||
} | |||
|
|||
fn try_from_dict_array<K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored into get_dict_value
Could it make sense to use |
I am not sure @jorgecarleitao -- in this case there is only a single row (potentially multiple columns) being compared so I am not sure if calling into the kernels would help at all |
Ok, maybe I am misunderstanding, sorry, it has been a while. If I recall, we will need to perform The implementation The suggestion to use the kernels to use a vectorized comparison, which leverages an aligned memory, no bound checks, and no type checking (i.e. no per item downcast). Sorry I do not have any code :/, was just a comment hinting to the opportunity to vectorize the operation. |
Yes, I think that is accurate. I would love to figure out how to use a vectorized approach In #808 we have a hash table mapping (effectively) for (hash, index) in ... {
// look up entry in hash table for row `index`
// check if the values at `[ArrayRefs]` @ `index` are the same as in the entry in the table (what this PR's code, `array_eq`, is used for)
// ...
} In order to vectorize this calculation, I think we would have to somehow vectorize both the lookup in the table as well as the comparison. I suppose we could potentially copy the existing keys out of the hash entries (so they are contiguous) to do a vectorized comparison but my intuition is that the cost of this copy would outweigh any gains due to vectorization |
I agree vectorizing that part can be hard I think it means somehow delaying the collision handling and doing it for the full batch instead. |
I think it's pretty hard as @alamb mentions to vectorize this part, as it also depends on the hashtable data structure (check collision on insert). I think a fully vectorized algorithm should build the table and do collision handling in different loops. |
I will also add some comments to the function explaining that this function has a narrow usecase and that the compute kernels should be preferred if at all possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also. Great work, @alamb
832803c
to
a2eec53
Compare
($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{ | ||
let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap(); | ||
let is_valid = array.is_valid($index); | ||
match $VALUE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid the match if !is_valid
? Would that make any difference to performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could if there was a specific eq_array
implementation for non-null arrays.
On most kernels / code, this has a non-negligible impact on performance.
The code path in the hash aggregate could then check whether the array contains 0 nulls and choose a different implementation if this is the case.
I think at this moment it might not have that much of an impact, maybe for the "easier" hash-aggregates with only few groups at might have a higher relative impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #850 to track this suggestion
a2eec53
to
49aa351
Compare
Github actions appears to be having issues https://www.githubstatus.com/ so no checks have run on this PR |
🎉 Thanks @Dandandan |
Which issue does this PR close?
Re #790 / part of #808 which can be reviewed independently
Rationale for this change
[ArrayRef]
to[ScalarValue]
is in the performance critical section and thus should be optimized. CreatingScalarValue
s from theArrayRef
s is too slow and results in copying.What changes are included in this PR?
eq_array
function toScalarValue
ScalarValue::try_from_array
for dictionary arraysAre there any user-facing changes?
No