-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize LIKE for more relaxed patterns #8050
Optimize LIKE for more relaxed patterns #8050
Conversation
✅ Deploy Preview for meta-velox canceled.
|
8292717
to
953f09b
Compare
@xumingming Thank you for working on this.
I haven't read the code yet, but I understand how to handle first 2 patterns, but I'm not sure what's the approach for implementing efficient matching for the last pattern. Would you clarify? |
velox/functions/lib/Re2Functions.cpp
Outdated
while (indexInString <= end) { | ||
// Search the firstFixedString to find out where to start match the whole | ||
// pattern. | ||
auto it = strstr(input.begin() + indexInString, firstFixedString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to apply strstr to non-zero-terminated strings. I feel this might not be safe. I'm also concerned about the performance of some edge cases where we need to keep running this loop many time shifting one char at a time. Would it be OK to not support this pattern, at least not in this PR, and focus on simple prefix and suffix patterns?
velox/functions/lib/Re2Functions.cpp
Outdated
startPatternIndex)) { | ||
return true; | ||
} else { | ||
// Not match the whole pattern, advance the cursor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be very slow for some cases like...
'aaaaaaa....aaaaa` LIKE '%a_b%'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um... you are right, it is indeed slow for this kind of cases, will move kRelaxedSubstring optimization into separate PR.
234d5bd
to
c278222
Compare
@mbasmanova Removed kRelaxedSubstring optimization from this PR and updated description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Thank you for iterating. Have a question.
/// k[Relaxed]Prefix, k[Relaxed]Suffix and k[Relaxed]Substring. | ||
std::string fixedPattern_; | ||
/// Contains the sub-patterns for k[Relaxed]Xxx patterns. | ||
std::vector<SubPatternMetadata> subPatterns_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but I wonder if a simpler way to model this would be to have a single string (h_e_ll_o) and a vector of pairs of start + length to indicate a list of substrings to match:
h_e_ll_o
[[1, 1], [3, 1], [5, 2], [7, 1]]
This way we do not need to create lots of small strings.
I noticed that BM shows that relaxes patterns a bit slower than fixed. Do you happen to know where the slowness comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Simplied the modeling as you suggested.
For the performance diff, I think its due to the complexity of relaxed pattern, take kPrefix and kRelaxedPrefix as example:
- For kPrefix, for every input row, we only need to compare the prefix string once.
- For kRelaxedPrefix, we need to access
PatternMetadata::subPatternKinds_
,PatternMetadata::subPatternRanges_
and do more string matching.
Might need to generate a flamegraph to confirm.
velox/functions/lib/Re2Functions.cpp
Outdated
bool matchRelaxedFixedPattern( | ||
StringView input, | ||
const std::vector<SubPatternMetadata>& subPatterns, | ||
size_t length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, change the order of arguments to start, length (seems more conventional)
velox/functions/lib/Re2Functions.cpp
Outdated
const std::vector<SubPatternMetadata>& subPatterns, | ||
size_t length, | ||
size_t start, | ||
size_t startPatternIndex = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this parameter used?
velox/functions/lib/Re2Functions.cpp
Outdated
|
||
auto indexInString = start; | ||
for (auto i = startPatternIndex; i < subPatterns.size(); i++) { | ||
auto& subPattern = subPatterns[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
velox/functions/lib/Re2Functions.cpp
Outdated
StringView input, | ||
const std::vector<SubPatternMetadata>& subPatterns, | ||
size_t length) { | ||
if (input.size() < length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this check since it is already included in matchRelaxedFixedPattern
we can remove this function altogether I think.
velox/functions/lib/Re2Functions.cpp
Outdated
StringView input, | ||
const std::vector<SubPatternMetadata>& subPatterns, | ||
size_t length) { | ||
if (input.size() < length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
c278222
to
39199a3
Compare
velox/functions/lib/Re2Functions.h
Outdated
std::string fixedPattern_; | ||
|
||
/// Contains the sub-pattern kinds for k[Relaxed]Xxx patterns. | ||
std::vector<SubPatternKind> subPatternKinds_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these? We can only have single wilfcards and fixed strings. It seems sufficient to track ranges of fixed strings, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to which range need to do exact match(e.g. 'abc'), which range only need to skip a certain length of string(e.g. '___', wildcards), only store range info is not eough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that there are only 2 possibilities: either skip or match. Hence, we can list only the ranges to match and assume the rest is to skip.
[[1, 3], [5, 4], [10, 7]] + pattern length 20 means
make sure there are at least 20 chars, then
skip to 1
match 3 chars
skip to 5
match 4 chars
skip to 10
match 7 chars
Would that work?
velox/functions/lib/Re2Functions.cpp
Outdated
} | ||
|
||
auto indexInString = start; | ||
for (auto i = 0; i < patternMetadata.numSubPatterns(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can be simplified if we just keep a list of ranges for fixed strings.
0e29466
to
0eebfa4
Compare
@mbasmanova All comments addressed. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xumingming The original optimization showed speed up of ~20x, but this is one showing much higher speedup. Is there some explanation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Overall looks good. Would you rebase?
velox/functions/lib/Re2Functions.cpp
Outdated
} | ||
|
||
// Compare each literal range. | ||
for (auto i = 0; i < patternMetadata.numLiteralRanges(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Wondering if the code can be simplified by iterating over the vector of ranges.
for (const auto& range : patternMetadata.literalRanges()) {
const auto start = range.first;
const auto length = range.second;
....memcmp...
}
velox/functions/lib/Re2Functions.cpp
Outdated
count++; | ||
if (firstIndex == -1) { | ||
firstIndex = index; | ||
lastIndex = index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is the same in both if and else branches; consider moving it outside of the if-else block
FYI, benchmark results I'm seeing
|
@@ -110,8 +110,20 @@ class Re2FunctionsTest : public test::FunctionBaseTest { | |||
std::optional<std::string>{input}); | |||
}; | |||
|
|||
// Print the failing pattern to make it easier to locate the failed case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova Sure, I plan to implement kRelaxedSubstring, I can do some refactoring before that. |
@xumingming Sounds good. Thanks. |
dddd77c
to
4807393
Compare
Updated the code to resolve all the review comments and rebased main. |
@xumingming format-check is failing. Would you take a look? Curious, would you like to write a blog post about all these optimizations for LIKE? |
In this PR we optimize LIKE operations for patterns which I call them kRelaxed[Prefix|Suffix] patterns, e.g. - kRelaxedPrefix: _a_bc%% - kRelaxedSuffix: %%_a_bc 'Relaxed' here means there is less restrictions than their counterparts. The algorithm of recognizing these relaxed patterns can be explained by an example, say we have a pattern '___hello___%%', it is split into 4 sub patterns: - [0] kSingleCharWildcard: ___ - [1] kLiteralString: hello - [2] kSingleCharWildcard: ___ - [3] kAnyCharsWildcard: %% Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed sub-patterns to do the matching. The benchmark result: Before(kGeneric): ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ like_generic##like_generic 1.34s 747.38m ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- like_prefix##like_prefix 340.30ms 2.94 like_prefix##like_relaxed_prefix_1 334.77ms 2.99 like_prefix##like_relaxed_prefix_2 350.70ms 2.85 like_prefix##starts_with 5.35ms 187.05 like_substring##like_substring 1.26s 790.87m like_substring##strpos 20.55ms 48.67 like_suffix##like_suffix 957.06ms 1.04 like_suffix##like_relaxed_suffix_1 935.90ms 1.07 like_suffix##like_relaxed_suffix_2 1.08s 926.79m like_suffix##ends_with 5.35ms 187.07 ``` After(kRelaxedPrefix, kRelaxedSuffix): ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ like_generic##like_generic 1.48s 674.92m ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- like_prefix##like_prefix 7.05ms 141.80 like_prefix##like_relaxed_prefix_1 9.06ms 110.36 like_prefix##like_relaxed_prefix_2 8.55ms 116.94 like_prefix##starts_with 5.34ms 187.22 like_substring##like_substring 22.47ms 44.50 like_substring##strpos 20.72ms 48.27 like_suffix##like_suffix 7.05ms 141.82 like_suffix##like_relaxed_suffix_1 9.08ms 110.16 like_suffix##like_relaxed_suffix_2 8.52ms 117.30 like_suffix##ends_with 5.35ms 187.07 ``` The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.
4807393
to
b8f5280
Compare
Fixed the format issue, and glad to write a blog post for LIKE :) |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Great. Here is an example of a PR that adds a blog post: and here is a link to the blog: https://velox-lib.io/blog/reduce-agg Looking forward to a blog post about LIKE optimizations. Thanks. |
Seeing errors:
|
@mbasmanova It is caused by the initialization logic of |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in ec3b082. |
Conbench analyzed the 1 benchmark run on commit There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
@xumingming James, we are seeing correctness issues in production. Here is a simple repro:
We are going to revert this change. |
Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Differential Revision: D53190666
…#8585) Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Differential Revision: D53190666
…#8585) Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Reviewed By: xiaoxmeng, bikramSingh91 Differential Revision: D53190666
This pull request has been reverted by c196497. |
@mbasmanova Sorry for the inconvenience , the failing patterns are
Should I re-submit a PR with the bug fixed? |
@xumingming James, thank you for taking a look. Feel free to re-submit the PR with the bug fixed and tests extended to catch it. |
In this PR we optimize LIKE operations for patterns which I call them
kRelaxed[Prefix|Suffix] patterns, e.g.
'Relaxed' here means there is less restrictions than their counterparts.
The algorithm of recognizing these relaxed patterns can be explained by
an example, say we have a pattern
___hello___%%
, it is split into 4sub-patterns:
Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can
determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed
sub-patterns to do the matching.
The benchmark result:
Before(kGeneric):
After(kRelaxedPrefix, kRelaxedSuffix):
The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.