-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize ASCII case for substr() #12444
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2010YOUY01, this PR makes sense to me. 👍
// A common pattern to call `substr()` is taking a small prefix of a long | ||
// string, such as `substr(long_str_with_1k_chars, 1, 32)`. | ||
// In such case the overhead of ASCII-validation may not be worth it, so | ||
// skip the validation for long strings for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check only the requested string prefix for being ascii?
could string_view_array.is_ascii
variant validate string prefixes of given length why still being vectorized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔
I thinking short prefixes are likely common (looking for http://
as a url prefix, for example). 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check only the requested string prefix for being ascii? could
string_view_array.is_ascii
variant validate string prefixes of given length why still being vectorized?
I think it's a good idea for the current situation
However in the long term we might use an alternative approach: do validation when reading arrays from storage to memory, and cache this is_ascii
property within the arrow array (as suggested by @alamb #12444 (review))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01 and @goldmedal
I am in general somewhat lukewarm on adding optimizations that make some queries faster and some slower (as it then becomes a tradeoff, and different users might have different tradeoffs).
It would be great to figure out how to avoid this tradeoff (I left one suggestion)
The other thing I keep thinking is how can we avoid this 'is_ascii' check at runtime (so things get faster regardless). Maybe it is time to consider starting to propage the is_ascii flag on the arrays themselves
The parquet reader, for example, knows when it has only ascii data
// A common pattern to call `substr()` is taking a small prefix of a long | ||
// string, such as `substr(long_str_with_1k_chars, 1, 32)`. | ||
// In such case the overhead of ASCII-validation may not be worth it, so | ||
// skip the validation for long strings for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔
I thinking short prefixes are likely common (looking for http://
as a url prefix, for example). 🤔
I think this regression is fixable in the long term (by making ASCII check more efficient, currently especially for Result:
I think it's a good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01 -- I think this PR and code are now looking quite good 👌
Thank you @goldmedal and @findepi for the review
// However, checking if a string is ASCII-only is relatively cheap. | ||
// If strings are ASCII only, use byte-based indices instead. | ||
// | ||
// A common pattern to call `substr()` is taking a small prefix of a long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let short_prefix_threshold = 32.0; | ||
let n_sample = 10; | ||
|
||
// HACK: can be simplified if function has specialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a good point this could be faster if it had a specialization for ScalarValue
Any chance you can file a ticket for this?
Which issue does this PR close?
Part of #12306
Rationale for this change
See the issue for the background.
Function arguments
start
andcount
insubstr(s, start, count)
are character-based, if string is utf-8 encoded, it would decodestart + count
characters. However if given strings are all ASCII encoded, function can calculate byte indices in constant time.One tricky condition for this function is: taking a small prefix of a long string (
e.g. substr(long_str_with_1k_chars, 1, 20)
).In this case the ASCII validation overhead can be greater than decoding a small number of characters. As a result, in some micro benchmarks (taking the first 6 bytes from 128 bytes string), this PR introduced ~5% slowdown.To avoid big regression for similar patterns, the implementation will check the approximate string length, and skip ASCII validation if strings are too long (See code comment for more detail)
The micro-benchmark result:
substr_baseline
- No optimizationsubstr_before
- StringView optimization (take substr by only modifying views and avoid copy the whole string), introduced by #12044substr_after
- StringView optimization + (this PR)ASCII fast pathWhat changes are included in this PR?
If input string is ASCII-only, use function arguments directly as byte indices to compute substring
Are these changes tested?
Existing sqllogictests have enough coverage for ASCII/NonASCII/Mixed test cases for
substr()
functionAre there any user-facing changes?