Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize ASCII case for substr() #12444

Merged
merged 2 commits into from
Sep 17, 2024
Merged

Specialize ASCII case for substr() #12444

merged 2 commits into from
Sep 17, 2024

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Sep 12, 2024

Which issue does this PR close?

Part of #12306

Rationale for this change

See the issue for the background.

Function arguments start and count in substr(s, start, count) are character-based, if string is utf-8 encoded, it would decode start + count characters. However if given strings are all ASCII encoded, function can calculate byte indices in constant time.

One tricky condition for this function is: taking a small prefix of a long string (e.g. substr(long_str_with_1k_chars, 1, 20)).In this case the ASCII validation overhead can be greater than decoding a small number of characters. As a result, in some micro benchmarks (taking the first 6 bytes from 128 bytes string), this PR introduced ~5% slowdown.
To avoid big regression for similar patterns, the implementation will check the approximate string length, and skip ASCII validation if strings are too long (See code comment for more detail)

The micro-benchmark result:
substr_baseline - No optimization
substr_before - StringView optimization (take substr by only modifying views and avoid copy the whole string), introduced by #12044
substr_after - StringView optimization + (this PR)ASCII fast path
image

What changes are included in this PR?

If input string is ASCII-only, use function arguments directly as byte indices to compute substring

Are these changes tested?

Existing sqllogictests have enough coverage for ASCII/NonASCII/Mixed test cases for substr() function

Are there any user-facing changes?

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01, this PR makes sense to me. 👍

Comment on lines 224 to 227
// A common pattern to call `substr()` is taking a small prefix of a long
// string, such as `substr(long_str_with_1k_chars, 1, 32)`.
// In such case the overhead of ASCII-validation may not be worth it, so
// skip the validation for long strings for now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check only the requested string prefix for being ascii?
could string_view_array.is_ascii variant validate string prefixes of given length why still being vectorized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔

I thinking short prefixes are likely common (looking for http:// as a url prefix, for example). 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check only the requested string prefix for being ascii? could string_view_array.is_ascii variant validate string prefixes of given length why still being vectorized?

I think it's a good idea for the current situation
However in the long term we might use an alternative approach: do validation when reading arrays from storage to memory, and cache this is_ascii property within the arrow array (as suggested by @alamb #12444 (review))

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 and @goldmedal

I am in general somewhat lukewarm on adding optimizations that make some queries faster and some slower (as it then becomes a tradeoff, and different users might have different tradeoffs).

It would be great to figure out how to avoid this tradeoff (I left one suggestion)

The other thing I keep thinking is how can we avoid this 'is_ascii' check at runtime (so things get faster regardless). Maybe it is time to consider starting to propage the is_ascii flag on the arrays themselves

The parquet reader, for example, knows when it has only ascii data

datafusion/functions/src/unicode/substr.rs Outdated Show resolved Hide resolved
Comment on lines 224 to 227
// A common pattern to call `substr()` is taking a small prefix of a long
// string, such as `substr(long_str_with_1k_chars, 1, 32)`.
// In such case the overhead of ASCII-validation may not be worth it, so
// skip the validation for long strings for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔

I thinking short prefixes are likely common (looking for http:// as a url prefix, for example). 🤔

@2010YOUY01
Copy link
Contributor Author

I am in general somewhat lukewarm on adding optimizations that make some queries faster and some slower (as it then becomes a tradeoff, and different users might have different tradeoffs).

It would be great to figure out how to avoid this tradeoff (I left one suggestion)

I think this regression is fixable in the long term (by making ASCII check more efficient, currently especially for StringView ASCII check is not the most efficient way), but it's a good idea to be more conservative and skip ASCII validation for small prefix for now.
I applied this suggestion and benched again and I think there is no noticeable ASCII check overhead:

Result:
substr_before is current main already with StringView optimization to avoid copy
susbtr_after is this PR with additional ASCII fast path

group                                                                              substr_after                           substr_before
-----                                                                              ------------                           -------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]               1.00     74.1±1.13µs        ? ?/sec    2.65    196.4±1.32µs        ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]               1.00    290.6±1.16µs        ? ?/sec    2.68   779.1±17.07µs        ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]                     1.00     72.9±0.25µs        ? ?/sec    2.91   212.2±13.48µs        ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]                     1.00    285.0±1.72µs        ? ?/sec    2.99   852.6±67.06µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]                1.00     29.7±0.17µs        ? ?/sec    5.61   166.5±24.98µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]                1.00    117.8±0.92µs        ? ?/sec    5.29   623.4±29.53µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]                         1.00     59.0±0.67µs        ? ?/sec    1.15     67.8±1.30µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]                         1.00    228.5±2.10µs        ? ?/sec    1.26   289.0±25.86µs        ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12]                               1.00     55.3±0.46µs        ? ?/sec    1.06     58.5±3.18µs        ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12]                               1.00    214.8±1.59µs        ? ?/sec    1.04    222.4±4.55µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]                          1.00     18.2±0.09µs        ? ?/sec    1.27     23.0±0.49µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]                          1.00     73.5±1.79µs        ? ?/sec    1.44   105.8±11.82µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]    1.00     75.9±0.40µs        ? ?/sec    1.04     78.8±3.79µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]    1.00    297.4±2.70µs        ? ?/sec    1.01    299.3±8.54µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]          1.00     77.8±0.24µs        ? ?/sec    1.07    83.4±10.36µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]          1.04    300.9±1.48µs        ? ?/sec    1.00    289.1±3.56µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]     1.06     33.3±0.63µs        ? ?/sec    1.00     31.5±0.15µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]     1.00    129.8±2.23µs        ? ?/sec    1.01   130.8±13.20µs        ? ?/sec

The other thing I keep thinking is how can we avoid this 'is_ascii' check at runtime (so things get faster regardless). Maybe it is time to consider starting to propage the is_ascii flag on the arrays themselves

The parquet reader, for example, knows when it has only ascii data

I think it's a good idea.
I'm curious (and also to justify the extra complexity), is your (InfluxDB) real workload dominated by String data? I saw somewhere Databricks and Tableau said their production workload has >50% string data, many are the substitute for UDT, and also uncleaned raw data, for such case it should be worth the effort

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- I think this PR and code are now looking quite good 👌

Thank you @goldmedal and @findepi for the review

// However, checking if a string is ASCII-only is relatively cheap.
// If strings are ASCII only, use byte-based indices instead.
//
// A common pattern to call `substr()` is taking a small prefix of a long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let short_prefix_threshold = 32.0;
let n_sample = 10;

// HACK: can be simplified if function has specialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a good point this could be faster if it had a specialization for ScalarValue

Any chance you can file a ticket for this?

@alamb alamb merged commit 55707dc into apache:main Sep 17, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants