Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScalarValue::eq_array optimized comparison function #844

Merged
merged 4 commits into from
Aug 11, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 9, 2021

Which issue does this PR close?

Re #790 / part of #808 which can be reviewed independently

Rationale for this change

  1. For the group by hash algorithm in Rework GroupByHash to for faster performance and support grouping by nulls #808, being able to compare values in [ArrayRef] to [ScalarValue] is in the performance critical section and thus should be optimized. Creating ScalarValues from the ArrayRefs is too slow and results in copying.

What changes are included in this PR?

  1. Add a specialized eq_array function to ScalarValue
  2. Test for same
  3. A bug fix for null handling in ScalarValue::try_from_array for dictionary arrays

Are there any user-facing changes?

No

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 9, 2021
@@ -277,6 +277,31 @@ impl std::hash::Hash for ScalarValue {
}
}

// return the index into the dictionary values for array@index as well
// as a reference to the dictionary values array. Returns None for the
// index if the array is NULL at index
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/arrow-rs#672 proposes adding this upstream in arrow

I think this properly handles null values now in the DictionaryArray, whereas y initial version did not

@@ -973,22 +1011,106 @@ impl ScalarValue {
})
}

fn try_from_dict_array<K: ArrowDictionaryKeyType>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored into get_dict_value

@jorgecarleitao
Copy link
Member

Could it make sense to use arrow::compute::kernels::comparison::eq_utf8_scalar, simd_compare_op and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.

@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2021

Could it make sense to use arrow::compute::kernels::comparison::eq_utf8_scalar, simd_compare_op and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.

I am not sure @jorgecarleitao -- in this case there is only a single row (potentially multiple columns) being compared so I am not sure if calling into the kernels would help at all

@jorgecarleitao
Copy link
Member

Ok, maybe I am misunderstanding, sorry, it has been a while.

If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).

The implementation array_eq will promote an non-vectorized approach where each operation requires a downcast and some conversions, i.e. it needs to check type (downcast), 2 bound checks (.is_valid and .value) and works on non-aligned memory (i.e. not all comparisons are done at once).

The suggestion to use the kernels to use a vectorized comparison, which leverages an aligned memory, no bound checks, and no type checking (i.e. no per item downcast). Sorry I do not have any code :/, was just a comment hinting to the opportunity to vectorize the operation.

@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2021

If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).

Yes, I think that is accurate. I would love to figure out how to use a vectorized approach

In #808 we have a hash table mapping (effectively) [ScalarValue] -> (aggregate data) and we have input [ArrayRef]. The hash calculation for each input row is vectorized (so that is good), and then there is a loop that looks like

for (hash, index) in ... {
  // look up entry in hash table for row `index`
  // check if the values at `[ArrayRefs]` @ `index` are the same as in the entry in the table (what this PR's code, `array_eq`, is used for)
  // ...
}

In order to vectorize this calculation, I think we would have to somehow vectorize both the lookup in the table as well as the comparison.

I suppose we could potentially copy the existing keys out of the hash entries (so they are contiguous) to do a vectorized comparison but my intuition is that the cost of this copy would outweigh any gains due to vectorization

@houqp houqp added the performance Make DataFusion faster label Aug 9, 2021
@Dandandan
Copy link
Contributor

I agree vectorizing that part can be hard I think it means somehow delaying the collision handling and doing it for the full batch instead.
That might require implementing a different hash table data structure or ignoring the collisions in the first place.
This is a good improvement over what we have.

@Dandandan
Copy link
Contributor

I think it's pretty hard as @alamb mentions to vectorize this part, as it also depends on the hashtable data structure (check collision on insert). I think a fully vectorized algorithm should build the table and do collision handling in different loops.

@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2021

I will also add some comments to the function explaining that this function has a narrow usecase and that the compute kernels should be preferred if at all possible.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also. Great work, @alamb

($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
let is_valid = array.is_valid($index);
match $VALUE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the match if !is_valid? Would that make any difference to performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could if there was a specific eq_array implementation for non-null arrays.
On most kernels / code, this has a non-negligible impact on performance.
The code path in the hash aggregate could then check whether the array contains 0 nulls and choose a different implementation if this is the case.
I think at this moment it might not have that much of an impact, maybe for the "easier" hash-aggregates with only few groups at might have a higher relative impact.

Copy link
Contributor Author

@alamb alamb Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #850 to track this suggestion

@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2021

Github actions appears to be having issues https://www.githubstatus.com/ so no checks have run on this PR

Screen Shot 2021-08-10 at 5 27 03 PM

@Dandandan Dandandan merged commit a4f6282 into apache:master Aug 11, 2021
@alamb
Copy link
Contributor Author

alamb commented Aug 11, 2021

🎉 Thanks @Dandandan

@alamb alamb deleted the alamb/array_eq_scalar branch August 8, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants