-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize Prefix/Suffix Match for Like/ILike
between Array and Scalar for StringViewArray
#6231
Conversation
Bench from my dev machine # xinli @ arch-dev in ~/source/repos/arrow-rs on git:dev/xinli1/optimize_prefix o [11:42:12] C:130
$ uname -a
Linux arch-dev 6.10.3-zen1-2-zen #1 ZEN SMP PREEMPT_DYNAMIC Tue, 06 Aug 2024 07:47:21 +0000 x86_64 GNU/Linux
# xinli @ arch-dev in ~/source/repos/arrow-rs on git:dev/xinli1/optimize_prefix o [12:33:29] C:130
$ critcmp master optimize_prefix optimize_prefix_boxed_iter
group master optimize_prefix optimize_prefix_boxed_iter
----- ------ --------------- --------------------------
like_utf8view scalar complex 1.01 172.7±1.45ms ? ?/sec 1.00 170.3±1.47ms ? ?/sec 1.00 171.0±1.55ms ? ?/sec
like_utf8view scalar contains 1.04 129.3±6.07ms ? ?/sec 1.01 126.2±1.46ms ? ?/sec 1.00 124.9±1.44ms ? ?/sec
like_utf8view scalar ends with 1.02 37.6±0.45ms ? ?/sec 1.00 36.7±0.38ms ? ?/sec 1.02 37.6±0.39ms ? ?/sec
like_utf8view scalar equals 1.01 26.2±0.42ms ? ?/sec 1.02 26.5±0.38ms ? ?/sec 1.00 26.0±0.27ms ? ?/sec
like_utf8view scalar starts with 1.90 32.8±1.58ms ? ?/sec 1.00 17.2±0.45ms ? ?/sec 1.89 32.5±0.45ms ? ?/sec |
Notablely this PR is just for |
I read through the conditions here. Ideally, when the lhs is a scalar and rhs is an array, use |
|
I did the benchmark on fixing the msrv issue. Either boxed iter or vector has hit the performance badly.. |
2x faster on starts_with. not bad! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xinlifoobar -- this is pretty cool
I think ti would be cooler if we can figure out how to use StringView to help prefixes that are up to 12 bytes long as well
Here is an updated benchmark for the latest code. It indicates the optimizations only work on the first 4/12 bytes. Any time it reaches the buffer, the perf is down. Given the result, I suspect it won't work on complex cases like # xinli @ arch-dev in ~/source/repos/arrow-rs on git:dev/xinli1/optimize_prefix o [22:23:48]
$ critcmp master optimize_prefix
group master optimize_prefix
----- ------ ---------------
like_utf8view scalar complex 1.00 174.5±2.43ms ? ?/sec 1.01 176.0±2.07ms ? ?/sec
like_utf8view scalar contains 1.00 129.2±1.50ms ? ?/sec 1.02 131.6±5.38ms ? ?/sec
like_utf8view scalar ends with 1.01 38.0±0.68ms ? ?/sec 1.00 37.7±0.38ms ? ?/sec
like_utf8view scalar equals 1.00 26.2±0.34ms ? ?/sec 1.00 26.2±0.37ms ? ?/sec
like_utf8view scalar starts with 1.63 32.6±0.40ms ? ?/sec 1.00 20.0±0.29ms ? ?/sec
like_utf8view scalar starts with more than 4 bytes 1.07 33.8±0.38ms ? ?/sec 1.00 31.7±0.28ms ? ?/sec |
I am hoping to find time to review this in more detail tomorrow |
I got some better results on other string view predicates. Will update them in batch tonight. |
4f0f49a
to
894e797
Compare
Seems the previous result are generated by falut code. Let me do more iterations for this. |
Updated the Benchmark results for the latest version.
|
Observations Based on Changes:
|
Like/ILike
between Array and ScalarLike/ILike
between Array and Scalar for StringViewArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xinlifoobar -- other than the suffix_iter producing potentially invalid &str
I think this PR looks good to me
Some minor improvements after use &[u8] directly. like_utf8view scalar complex
----------------------------
optimized_prefix_suffix_bytes 1.00 171.9±2.78ms ? ?/sec
optimized_prefix_suffix 1.02 174.7±2.58ms ? ?/sec
master_08_19 1.07 184.5±15.76ms ? ?/sec
like_utf8view scalar contains
-----------------------------
optimized_prefix_suffix_bytes 1.00 124.7±1.66ms ? ?/sec
optimized_prefix_suffix 1.02 127.5±2.69ms ? ?/sec
master_08_19 1.05 130.7±3.52ms ? ?/sec
like_utf8view scalar ends with
------------------------------
optimized_prefix_suffix_bytes 1.00 33.6±0.52ms ? ?/sec
optimized_prefix_suffix 1.02 34.3±0.56ms ? ?/sec
master_08_19 1.12 37.8±0.54ms ? ?/sec
like_utf8view scalar equals
---------------------------
master_08_19 1.00 26.4±0.59ms ? ?/sec
optimized_prefix_suffix_bytes 1.01 26.7±0.56ms ? ?/sec
optimized_prefix_suffix 1.03 27.2±0.94ms ? ?/sec
like_utf8view scalar starts with
--------------------------------
optimized_prefix_suffix_bytes 1.00 20.4±0.27ms ? ?/sec
optimized_prefix_suffix 1.01 20.6±0.65ms ? ?/sec
master_08_19 1.61 32.8±0.55ms ? ?/sec
like_utf8view scalar starts with more than 4 bytes
--------------------------------------------------
optimized_prefix_suffix_bytes 1.00 31.8±0.35ms ? ?/sec
optimized_prefix_suffix 1.00 31.9±0.44ms ? ?/sec
master_08_19 1.07 34.0±0.53ms ? ?/sec |
Like/ILike
between Array and Scalar for StringViewArrayLike/ILike
between Array and Scalar for StringViewArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xinlifoobar -- I am running the benchmarks on this branch and will report back
// 😈 is four bytes long. | ||
test_utf8_scalar!( | ||
test_uff8_array_like_multibyte, | ||
vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it occurs to me we should also be testing with Option
s as well (aka the test data should have nulls)
let len = (*v as u32) as usize; | ||
|
||
if len < prefix_len { | ||
return &[] as &[u8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you mentioned above, having to return an empty slice just for the function to immediate check it again might be another potential performance improvement
What do you think about making this more general and take a function? Maybe something like the following (untested)
/// Applies function `f` to the first `prefix_len` bytes for all views
/// if the view length is less tha prefix_len func is invoked with None(T)
pub fn prefix_bytes_iter<F, T>(&self, prefix_len: usize, func: F) -> impl Iterator<Item = T>
where
F: FnMut(Option<&[u8]>) -> T
{
...
}
I am not sure this is a good idea but figured maybe it would be more general. But maybe not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought passing a function pointer to the *_iters
was a bad decision. I did this actually in the first version of this PR, e.g.,
pub fn predicate(&self, func: F) -> Impl ArrayRef
where F: FnMut(Option<&[u8]>) -> T
{
}
# or
pub fn predicate_prefix(&self, func: F) -> Impl ArrayRef
where F: FnMut(Option<&[u8]>) -> T
{
}
This was good, but a circular on the crate dependencies was introduced, i.e.,
# past
Predicate --evaluate_array--> Array
# after
Predicate --evaluate_array--> Array --predicate--> Predicate Function --evaluate--> Array Item.
This could be solved by re-layouting the code but lots of changes there.
Also, the functions are very specialized, as they should not be. The function signature is not flexible enough to generalize all such requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- thank you for the explanation. Let's keep exploring this method for now
🤔 the benchmarks fail now for me like
|
This issue could be repro on the |
This would be a more complex fix than expected. The following functions, including |
Thanks @xinlifoobar -- indeed this does appear to be an issue on the master branch. I filed #6283 and will fix it shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for @xinlifoobar. I think this PR is looking quite nice now.
I am sorry for all the back and forth with this PR, but I think we are on good track now. Doing performance optimizations always seems to take much longer than I think/hope it should :)
I think the high level structure / idea of this PR is looking good
What I think is needed next is to run the benchmarks and see how much better this branch is than master
(and validate if bytes_iter() makes a difference, for example)
I think once we have merged #6284 and merged up to this branch we should be able to test it out.
.collect::<Vec<_>>(), | ||
) | ||
} else { | ||
BooleanArray::from_unary(array, |haystack| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking more carefully at BooleanArray::from_unary
it will use the ArrayAccessor
impl for StringViewArray
arrow-rs/arrow-array/src/array/byte_view_array.rs
Lines 547 to 557 in 7eb3c83
impl<'a, T: ByteViewType + ?Sized> ArrayAccessor for &'a GenericByteViewArray<T> { | |
type Item = &'a T::Native; | |
fn value(&self, index: usize) -> Self::Item { | |
GenericByteViewArray::value(self, index) | |
} | |
unsafe fn value_unchecked(&self, index: usize) -> Self::Item { | |
GenericByteViewArray::value_unchecked(self, index) | |
} | |
} |
It isn't clear to me that how calling bytes_iter()
would make this faster as the code for value_unchecked
is the same as butes_iter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test should be we'll try benchmarking and see if this improves things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I thought the only differences between bytes_iter
and ArrayAccessor
is
For bytes_iter
self.views().has_next()? -> self.views().next() -> value_unchecked()
For ArrayAccessor
index = index + 1 -> self.views.get_unchecked(idx) -> str(value_unchecked()).as_bytes()
There are merely differences between the indexing operations and iterator methods. The benchmark also indicates in %5 ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR here to refine this documentation: #6306
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another difference is that bytes_iterator()
iterates over all array slots, including those that are null
let len = (*v as u32) as usize; | ||
|
||
if len < prefix_len { | ||
return &[] as &[u8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- thank you for the explanation. Let's keep exploring this method for now
New benchmark results. It looks like the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this is good to go. Thank you @xinlifoobar this is very nice
It would be nice to avoid the need for bytes_iter
if possible before release. I will explore doing so in a follow on PR
I ran the benchmarks again 👨🍳 👌
++ critcmp master dev_xinli1_optimize_prefix
group dev_xinli1_optimize_prefix master
----- -------------------------- ------
ilike_utf8 scalar complex 1.00 2.7±0.08ms ? ?/sec 1.00 2.7±0.09ms ? ?/sec
ilike_utf8 scalar contains 1.00 4.2±0.07ms ? ?/sec 1.00 4.2±0.08ms ? ?/sec
ilike_utf8 scalar ends with 1.00 1235.1±38.98µs ? ?/sec 1.00 1240.9±41.24µs ? ?/sec
ilike_utf8 scalar equals 1.00 773.0±23.01µs ? ?/sec 1.01 780.1±20.32µs ? ?/sec
ilike_utf8 scalar starts with 1.00 1132.5±25.82µs ? ?/sec 1.00 1135.2±39.42µs ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.3±0.16µs ? ?/sec 1.00 88.2±0.09µs ? ?/sec
like_utf8 scalar complex 1.01 1876.9±53.81µs ? ?/sec 1.00 1859.7±27.02µs ? ?/sec
like_utf8 scalar contains 1.00 1726.5±15.40µs ? ?/sec 1.03 1774.5±19.87µs ? ?/sec
like_utf8 scalar ends with 1.03 440.5±6.68µs ? ?/sec 1.00 426.8±13.04µs ? ?/sec
like_utf8 scalar equals 1.00 90.9±0.22µs ? ?/sec 1.39 126.8±0.13µs ? ?/sec
like_utf8 scalar starts with 1.03 341.8±5.19µs ? ?/sec 1.00 333.4±3.99µs ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.2±0.18µs ? ?/sec 1.00 88.0±0.11µs ? ?/sec
like_utf8view scalar complex 1.03 183.7±1.27ms ? ?/sec 1.00 179.0±0.62ms ? ?/sec
like_utf8view scalar contains 1.00 129.9±0.28ms ? ?/sec 1.04 135.4±0.22ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.00 43.7±0.27ms ? ?/sec 1.15 50.5±0.22ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 44.7±0.16ms ? ?/sec 1.22 54.4±0.21ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.00 44.7±0.23ms ? ?/sec 1.24 55.5±0.11ms ? ?/sec
like_utf8view scalar equals 1.00 32.3±0.09ms ? ?/sec 1.07 34.5±0.07ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.00 45.9±0.15ms ? ?/sec 1.00 46.1±0.30ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 25.2±0.11ms ? ?/sec 1.93 48.6±0.10ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.00 46.3±0.21ms ? ?/sec 1.07 49.5±0.19ms ? ?/sec
🚀 -- thanks again for sticking with this @xinlifoobar |
Which issue does this PR close?
Parts of #5951.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?