Add ScalarValue::eq_array optimized comparison function #844

alamb · 2021-08-09T15:46:49Z

Which issue does this PR close?

Re #790 / part of #808 which can be reviewed independently

Rationale for this change

For the group by hash algorithm in Rework GroupByHash to for faster performance and support grouping by nulls #808, being able to compare values in [ArrayRef] to [ScalarValue] is in the performance critical section and thus should be optimized. Creating ScalarValues from the ArrayRefs is too slow and results in copying.

What changes are included in this PR?

Add a specialized eq_array function to ScalarValue
Test for same
A bug fix for null handling in ScalarValue::try_from_array for dictionary arrays

Are there any user-facing changes?

No

alamb · 2021-08-09T15:48:10Z

datafusion/src/scalar.rs

@@ -277,6 +277,31 @@ impl std::hash::Hash for ScalarValue {
    }
 }

+// return the index into the dictionary values for array@index as well
+// as a reference to the dictionary values array. Returns None for the
+// index if the array is NULL at index


apache/arrow-rs#672 proposes adding this upstream in arrow

I think this properly handles null values now in the DictionaryArray, whereas y initial version did not

alamb · 2021-08-09T15:48:49Z

datafusion/src/scalar.rs

@@ -973,22 +1011,106 @@ impl ScalarValue {
        })
    }

-    fn try_from_dict_array<K: ArrowDictionaryKeyType>(


refactored into get_dict_value

jorgecarleitao · 2021-08-09T15:52:54Z

Could it make sense to use arrow::compute::kernels::comparison::eq_utf8_scalar, simd_compare_op and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.

alamb · 2021-08-09T16:06:48Z

Could it make sense to use arrow::compute::kernels::comparison::eq_utf8_scalar, simd_compare_op and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.

I am not sure @jorgecarleitao -- in this case there is only a single row (potentially multiple columns) being compared so I am not sure if calling into the kernels would help at all

jorgecarleitao · 2021-08-09T16:26:33Z

Ok, maybe I am misunderstanding, sorry, it has been a while.

If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).

The implementation array_eq will promote an non-vectorized approach where each operation requires a downcast and some conversions, i.e. it needs to check type (downcast), 2 bound checks (.is_valid and .value) and works on non-aligned memory (i.e. not all comparisons are done at once).

The suggestion to use the kernels to use a vectorized comparison, which leverages an aligned memory, no bound checks, and no type checking (i.e. no per item downcast). Sorry I do not have any code :/, was just a comment hinting to the opportunity to vectorize the operation.

alamb · 2021-08-09T16:56:58Z

If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).

Yes, I think that is accurate. I would love to figure out how to use a vectorized approach

In #808 we have a hash table mapping (effectively) [ScalarValue] -> (aggregate data) and we have input [ArrayRef]. The hash calculation for each input row is vectorized (so that is good), and then there is a loop that looks like

for (hash, index) in ... {
  // look up entry in hash table for row `index`
  // check if the values at `[ArrayRefs]` @ `index` are the same as in the entry in the table (what this PR's code, `array_eq`, is used for)
  // ...
}

In order to vectorize this calculation, I think we would have to somehow vectorize both the lookup in the table as well as the comparison.

I suppose we could potentially copy the existing keys out of the hash entries (so they are contiguous) to do a vectorized comparison but my intuition is that the cost of this copy would outweigh any gains due to vectorization

Dandandan · 2021-08-10T05:25:56Z

I agree vectorizing that part can be hard I think it means somehow delaying the collision handling and doing it for the full batch instead.
That might require implementing a different hash table data structure or ignoring the collisions in the first place.
This is a good improvement over what we have.

Dandandan · 2021-08-10T05:38:15Z

I think it's pretty hard as @alamb mentions to vectorize this part, as it also depends on the hashtable data structure (check collision on insert). I think a fully vectorized algorithm should build the table and do collision handling in different loops.

alamb · 2021-08-10T10:38:09Z

I will also add some comments to the function explaining that this function has a narrow usecase and that the compute kernels should be preferred if at all possible.

jorgecarleitao

LGTM also. Great work, @alamb

andygrove · 2021-08-10T14:13:10Z

datafusion/src/scalar.rs

+    ($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
+        let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
+        let is_valid = array.is_valid($index);
+        match $VALUE {


can we avoid the match if !is_valid? Would that make any difference to performance?

I think it could if there was a specific eq_array implementation for non-null arrays.
On most kernels / code, this has a non-negligible impact on performance.
The code path in the hash aggregate could then check whether the array contains 0 nulls and choose a different implementation if this is the case.
I think at this moment it might not have that much of an impact, maybe for the "easier" hash-aggregates with only few groups at might have a higher relative impact.

Filed #850 to track this suggestion

alamb · 2021-08-10T21:27:29Z

Github actions appears to be having issues https://www.githubstatus.com/ so no checks have run on this PR

alamb · 2021-08-11T08:44:35Z

🎉 Thanks @Dandandan

github-actions bot added the datafusion Changes in the datafusion crate label Aug 9, 2021

alamb commented Aug 9, 2021

View reviewed changes

houqp added the performance Make DataFusion faster label Aug 9, 2021

Dandandan approved these changes Aug 10, 2021

View reviewed changes

jorgecarleitao approved these changes Aug 10, 2021

View reviewed changes

alamb mentioned this pull request Aug 10, 2021

Rework GroupByHash to for faster performance and support grouping by nulls #808

Merged

6 tasks

alamb force-pushed the alamb/array_eq_scalar branch from 832803c to a2eec53 Compare August 10, 2021 10:52

andygrove reviewed Aug 10, 2021

View reviewed changes

alamb added 2 commits August 10, 2021 17:15

Add ScalarValue::eq_array

9886303

Add note about the narrow use case for array_eq_scalar

49aa351

alamb force-pushed the alamb/array_eq_scalar branch from a2eec53 to 49aa351 Compare August 10, 2021 21:15

alamb mentioned this pull request Aug 10, 2021

Optimize hash_aggregate when there are no null group keys #850

Closed

alamb added 2 commits August 10, 2021 19:48

reword comment to force checks to run

984dd57

fix text

b5d5588

Dandandan merged commit a4f6282 into apache:master Aug 11, 2021

alamb deleted the alamb/array_eq_scalar branch August 8, 2023 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ScalarValue::eq_array optimized comparison function #844

Add ScalarValue::eq_array optimized comparison function #844

alamb commented Aug 9, 2021

alamb Aug 9, 2021

alamb Aug 9, 2021

jorgecarleitao commented Aug 9, 2021

alamb commented Aug 9, 2021

jorgecarleitao commented Aug 9, 2021

alamb commented Aug 9, 2021 •

edited

Loading

Dandandan commented Aug 10, 2021

Dandandan commented Aug 10, 2021

alamb commented Aug 10, 2021 •

edited

Loading

jorgecarleitao left a comment

andygrove Aug 10, 2021

Dandandan Aug 10, 2021

alamb Aug 10, 2021 •

edited

Loading

alamb commented Aug 10, 2021

alamb commented Aug 11, 2021

Add ScalarValue::eq_array optimized comparison function #844

Add ScalarValue::eq_array optimized comparison function #844

Conversation

alamb commented Aug 9, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Aug 9, 2021

Choose a reason for hiding this comment

alamb Aug 9, 2021

Choose a reason for hiding this comment

jorgecarleitao commented Aug 9, 2021

alamb commented Aug 9, 2021

jorgecarleitao commented Aug 9, 2021

alamb commented Aug 9, 2021 • edited Loading

Dandandan commented Aug 10, 2021

Dandandan commented Aug 10, 2021

alamb commented Aug 10, 2021 • edited Loading

jorgecarleitao left a comment

Choose a reason for hiding this comment

andygrove Aug 10, 2021

Choose a reason for hiding this comment

Dandandan Aug 10, 2021

Choose a reason for hiding this comment

alamb Aug 10, 2021 • edited Loading

Choose a reason for hiding this comment

alamb commented Aug 10, 2021

alamb commented Aug 11, 2021

alamb commented Aug 9, 2021 •

edited

Loading

alamb commented Aug 10, 2021 •

edited

Loading

alamb Aug 10, 2021 •

edited

Loading