Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize LIKE for more relaxed patterns #8050

Conversation

xumingming
Copy link
Contributor

@xumingming xumingming commented Dec 14, 2023

In this PR we optimize LIKE operations for patterns which I call them
kRelaxed[Prefix|Suffix] patterns, e.g.

  • kRelaxedPrefix: _a_bc%%
  • kRelaxedSuffix: %%_a_bc

'Relaxed' here means there is less restrictions than their counterparts.
The algorithm of recognizing these relaxed patterns can be explained by
an example, say we have a pattern ___hello___%%, it is split into 4
sub-patterns:

  • [0] kSingleCharWildcard: ___
  • [1] kLiteralString: hello
  • [2] kSingleCharWildcard: ___
  • [3] kAnyCharsWildcard: %%

Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can
determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed
sub-patterns to do the matching.

The benchmark result:

Before(kGeneric):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.34s   747.38m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                  340.30ms      2.94
like_prefix##like_relaxed_prefix_1                        334.77ms      2.99
like_prefix##like_relaxed_prefix_2                        350.70ms      2.85
like_prefix##starts_with                                    5.35ms    187.05
like_substring##like_substring                               1.26s   790.87m
like_substring##strpos                                     20.55ms     48.67
like_suffix##like_suffix                                  957.06ms      1.04
like_suffix##like_relaxed_suffix_1                        935.90ms      1.07
like_suffix##like_relaxed_suffix_2                           1.08s   926.79m
like_suffix##ends_with                                      5.35ms    187.07

After(kRelaxedPrefix, kRelaxedSuffix):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.48s   674.92m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                    7.05ms    141.80
like_prefix##like_relaxed_prefix_1                          9.06ms    110.36
like_prefix##like_relaxed_prefix_2                          8.55ms    116.94
like_prefix##starts_with                                    5.34ms    187.22
like_substring##like_substring                             22.47ms     44.50
like_substring##strpos                                     20.72ms     48.27
like_suffix##like_suffix                                    7.05ms    141.82
like_suffix##like_relaxed_suffix_1                          9.08ms    110.16
like_suffix##like_relaxed_suffix_2                          8.52ms    117.30
like_suffix##ends_with                                      5.35ms    187.07

The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.

Copy link

netlify bot commented Dec 14, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 4fa5670
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65b3707501c83b0008294b59

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 14, 2023
@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from 8292717 to 953f09b Compare December 14, 2023 15:34
@mbasmanova
Copy link
Contributor

@xumingming Thank you for working on this.

kRelaxedPrefix: _a_bc%%
kRelaxedSuffix: %%_a_bc
kRelaxedSubstring: %%_a_bc%%

I haven't read the code yet, but I understand how to handle first 2 patterns, but I'm not sure what's the approach for implementing efficient matching for the last pattern. Would you clarify?

while (indexInString <= end) {
// Search the firstFixedString to find out where to start match the whole
// pattern.
auto it = strstr(input.begin() + indexInString, firstFixedString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to apply strstr to non-zero-terminated strings. I feel this might not be safe. I'm also concerned about the performance of some edge cases where we need to keep running this loop many time shifting one char at a time. Would it be OK to not support this pattern, at least not in this PR, and focus on simple prefix and suffix patterns?

startPatternIndex)) {
return true;
} else {
// Not match the whole pattern, advance the cursor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be very slow for some cases like...

'aaaaaaa....aaaaa` LIKE '%a_b%'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um... you are right, it is indeed slow for this kind of cases, will move kRelaxedSubstring optimization into separate PR.

@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch 3 times, most recently from 234d5bd to c278222 Compare December 15, 2023 05:45
@xumingming
Copy link
Contributor Author

@mbasmanova Removed kRelaxedSubstring optimization from this PR and updated description.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Thank you for iterating. Have a question.

velox/functions/lib/Re2Functions.h Show resolved Hide resolved
/// k[Relaxed]Prefix, k[Relaxed]Suffix and k[Relaxed]Substring.
std::string fixedPattern_;
/// Contains the sub-patterns for k[Relaxed]Xxx patterns.
std::vector<SubPatternMetadata> subPatterns_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but I wonder if a simpler way to model this would be to have a single string (h_e_ll_o) and a vector of pairs of start + length to indicate a list of substrings to match:

h_e_ll_o
[[1, 1], [3, 1], [5, 2], [7, 1]]

This way we do not need to create lots of small strings.

I noticed that BM shows that relaxes patterns a bit slower than fixed. Do you happen to know where the slowness comes from?

Copy link
Contributor Author

@xumingming xumingming Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Simplied the modeling as you suggested.

For the performance diff, I think its due to the complexity of relaxed pattern, take kPrefix and kRelaxedPrefix as example:

  • For kPrefix, for every input row, we only need to compare the prefix string once.
  • For kRelaxedPrefix, we need to access PatternMetadata::subPatternKinds_, PatternMetadata::subPatternRanges_ and do more string matching.

Might need to generate a flamegraph to confirm.

bool matchRelaxedFixedPattern(
StringView input,
const std::vector<SubPatternMetadata>& subPatterns,
size_t length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, change the order of arguments to start, length (seems more conventional)

const std::vector<SubPatternMetadata>& subPatterns,
size_t length,
size_t start,
size_t startPatternIndex = 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter used?


auto indexInString = start;
for (auto i = startPatternIndex; i < subPatterns.size(); i++) {
auto& subPattern = subPatterns[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

StringView input,
const std::vector<SubPatternMetadata>& subPatterns,
size_t length) {
if (input.size() < length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this check since it is already included in matchRelaxedFixedPattern

we can remove this function altogether I think.

StringView input,
const std::vector<SubPatternMetadata>& subPatterns,
size_t length) {
if (input.size() < length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from c278222 to 39199a3 Compare December 15, 2023 17:48
std::string fixedPattern_;

/// Contains the sub-pattern kinds for k[Relaxed]Xxx patterns.
std::vector<SubPatternKind> subPatternKinds_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these? We can only have single wilfcards and fixed strings. It seems sufficient to track ranges of fixed strings, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to which range need to do exact match(e.g. 'abc'), which range only need to skip a certain length of string(e.g. '___', wildcards), only store range info is not eough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that there are only 2 possibilities: either skip or match. Hence, we can list only the ranges to match and assume the rest is to skip.

[[1, 3], [5, 4], [10, 7]] + pattern length 20 means

make sure there are at least 20 chars, then
skip to 1
match 3 chars
skip to 5
match 4 chars
skip to 10
match 7 chars

Would that work?

}

auto indexInString = start;
for (auto i = 0; i < patternMetadata.numSubPatterns(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop can be simplified if we just keep a list of ranges for fixed strings.

@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch 4 times, most recently from 0e29466 to 0eebfa4 Compare December 18, 2023 08:46
@xumingming
Copy link
Contributor Author

@mbasmanova All comments addressed.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.

@xumingming The original optimization showed speed up of ~20x, but this is one showing much higher speedup. Is there some explanation?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Overall looks good. Would you rebase?

}

// Compare each literal range.
for (auto i = 0; i < patternMetadata.numLiteralRanges(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Wondering if the code can be simplified by iterating over the vector of ranges.

for (const auto& range : patternMetadata.literalRanges()) {
   const auto start = range.first;
   const auto length = range.second;
   ....memcmp...
}

count++;
if (firstIndex == -1) {
firstIndex = index;
lastIndex = index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is the same in both if and else branches; consider moving it outside of the if-else block

velox/functions/lib/Re2Functions.cpp Show resolved Hide resolved
velox/functions/lib/Re2Functions.cpp Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

Seeing lot of linter warnings. Some examples:

Screenshot 2023-12-19 at 7 45 03 PM Screenshot 2023-12-19 at 7 45 24 PM Screenshot 2023-12-19 at 7 45 56 PM Screenshot 2023-12-19 at 7 46 14 PM

@mbasmanova
Copy link
Contributor

FYI, benchmark results I'm seeing

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.10s   906.81m
like_prefix##like_prefix                                   12.77ms     78.32
like_prefix##like_relaxed_prefix_1                         16.32ms     61.26
like_prefix##like_relaxed_prefix_2                         13.57ms     73.69
like_prefix##starts_with                                   13.21ms     75.69
like_substring##like_substring                             28.42ms     35.19
like_substring##strpos                                     22.80ms     43.86
like_suffix##like_suffix                                   13.63ms     73.38
like_suffix##like_relaxed_suffix_1                         17.26ms     57.95
like_suffix##like_relaxed_suffix_2                         14.42ms     69.35
like_suffix##ends_with                                     13.31ms     75.13

@@ -110,8 +110,20 @@ class Re2FunctionsTest : public test::FunctionBaseTest {
std::optional<std::string>{input});
};

// Print the failing pattern to make it easier to locate the failed case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@xumingming
Copy link
Contributor Author

@xumingming Thank you for working on this. Looks fine to me % some nits. I feel we'll have to refactor the code for readability if we decide to add more logic.

@mbasmanova Sure, I plan to implement kRelaxedSubstring, I can do some refactoring before that.

@mbasmanova
Copy link
Contributor

I plan to implement kRelaxedSubstring, I can do some refactoring before that.

@xumingming Sounds good. Thanks.

@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from dddd77c to 4807393 Compare January 25, 2024 12:14
@xumingming
Copy link
Contributor Author

Updated the code to resolve all the review comments and rebased main.

@mbasmanova
Copy link
Contributor

@xumingming format-check is failing. Would you take a look?

Curious, would you like to write a blog post about all these optimizations for LIKE?

In this PR we optimize LIKE operations for patterns which I call them
kRelaxed[Prefix|Suffix] patterns, e.g.

- kRelaxedPrefix: _a_bc%%
- kRelaxedSuffix: %%_a_bc

'Relaxed' here means there is less restrictions than their counterparts.
The algorithm of recognizing these relaxed patterns can be explained by
an example, say we have a pattern '___hello___%%', it is split into 4
sub patterns:

- [0] kSingleCharWildcard: ___
- [1] kLiteralString: hello
- [2] kSingleCharWildcard: ___
- [3] kAnyCharsWildcard: %%

Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can
determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed
sub-patterns to do the matching.

The benchmark result:

Before(kGeneric):

```
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.34s   747.38m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                  340.30ms      2.94
like_prefix##like_relaxed_prefix_1                        334.77ms      2.99
like_prefix##like_relaxed_prefix_2                        350.70ms      2.85
like_prefix##starts_with                                    5.35ms    187.05
like_substring##like_substring                               1.26s   790.87m
like_substring##strpos                                     20.55ms     48.67
like_suffix##like_suffix                                  957.06ms      1.04
like_suffix##like_relaxed_suffix_1                        935.90ms      1.07
like_suffix##like_relaxed_suffix_2                           1.08s   926.79m
like_suffix##ends_with                                      5.35ms    187.07
```

After(kRelaxedPrefix, kRelaxedSuffix):

```
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.48s   674.92m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                    7.05ms    141.80
like_prefix##like_relaxed_prefix_1                          9.06ms    110.36
like_prefix##like_relaxed_prefix_2                          8.55ms    116.94
like_prefix##starts_with                                    5.34ms    187.22
like_substring##like_substring                             22.47ms     44.50
like_substring##strpos                                     20.72ms     48.27
like_suffix##like_suffix                                    7.05ms    141.82
like_suffix##like_relaxed_suffix_1                          9.08ms    110.16
like_suffix##like_relaxed_suffix_2                          8.52ms    117.30
like_suffix##ends_with                                      5.35ms    187.07
```

The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.
@xumingming xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from 4807393 to b8f5280 Compare January 25, 2024 13:13
@xumingming
Copy link
Contributor Author

@xumingming format-check is failing. Would you take a look?

Curious, would you like to write a blog post about all these optimizations for LIKE?

Fixed the format issue, and glad to write a blog post for LIKE :)

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

glad to write a blog post for LIKE :)

Great. Here is an example of a PR that adds a blog post:

#6851

and here is a link to the blog:

https://velox-lib.io/blog/reduce-agg

Looking forward to a blog post about LIKE optimizations. Thanks.

@mbasmanova
Copy link
Contributor

Seeing errors:

fbcode/velox/functions/lib/Re2Functions.cpp:1516:24: runtime error: load of value 272, which is not a valid value for type 'facebook::velox::functions::SubPatternKind'
    #0 0x7f732f7e4ee7 in facebook::velox::functions::parsePattern[abi:cxx11](std::basic_string_view<char, std::char_traits<char>>, std::optional<char>, std::vector<facebook::velox::functions::SubPatternKind, std::allocator<facebook::velox::functions::SubPatternKind>>&, std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long>>>&) fbcode/velox/functions/lib/Re2Functions.cpp:1516

SUMMARY: UndefinedBehaviorSanitizer: invalid-enum-load fbcode/velox/functions/lib/Re2Functions.cpp:1516:24 in 

@xumingming
Copy link
Contributor Author

Seeing errors:

fbcode/velox/functions/lib/Re2Functions.cpp:1516:24: runtime error: load of value 272, which is not a valid value for type 'facebook::velox::functions::SubPatternKind'
    #0 0x7f732f7e4ee7 in facebook::velox::functions::parsePattern[abi:cxx11](std::basic_string_view<char, std::char_traits<char>>, std::optional<char>, std::vector<facebook::velox::functions::SubPatternKind, std::allocator<facebook::velox::functions::SubPatternKind>>&, std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long>>>&) fbcode/velox/functions/lib/Re2Functions.cpp:1516

SUMMARY: UndefinedBehaviorSanitizer: invalid-enum-load fbcode/velox/functions/lib/Re2Functions.cpp:1516:24 in 

@mbasmanova It is caused by the initialization logic of previousKind, fixed.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in ec3b082.

Copy link

Conbench analyzed the 1 benchmark run on commit ec3b082f.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

@mbasmanova
Copy link
Contributor

@xumingming James, we are seeing correctness issues in production. Here is a simple repro:

TEST_F(Re2FunctionsTest, xxx) {
  auto x = evaluateOnce<bool, std::string>(
      "c0 like 'fblearner_'", "fblearner_global");
  EXPECT_FALSE(x);

  x = evaluateOnce<bool, std::string>("c0 like 'fblearner_'", "fblearner");
  EXPECT_FALSE(x);

  x = evaluateOnce<bool, std::string>("c0 like 'fblearner_'", "fblearner_");
  EXPECT_TRUE(x);
}

We are going to revert this change.

mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Jan 29, 2024
Summary:
See facebookincubator#8050

Original commit changeset: c307f3f93949

Original Phabricator Diff: D52312645

Differential Revision: D53190666
mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Jan 29, 2024
…#8585)

Summary:

See facebookincubator#8050

Original commit changeset: c307f3f93949

Original Phabricator Diff: D52312645

Differential Revision: D53190666
mbasmanova added a commit to mbasmanova/velox-1 that referenced this pull request Jan 29, 2024
…#8585)

Summary:

See facebookincubator#8050

Original commit changeset: c307f3f93949

Original Phabricator Diff: D52312645

Reviewed By: xiaoxmeng, bikramSingh91

Differential Revision: D53190666
facebook-github-bot pushed a commit that referenced this pull request Jan 29, 2024
Summary:
Pull Request resolved: #8585

See #8050

Original commit changeset: c307f3f93949

Original Phabricator Diff: D52312645

Reviewed By: xiaoxmeng, bikramSingh91

Differential Revision: D53190666

fbshipit-source-id: 3b6b741b812cc3aa0f08584fde7d3106d4e2ba28
@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by c196497.

@xumingming
Copy link
Contributor Author

@mbasmanova Sorry for the inconvenience , the failing patterns are kRelaxedFixed, the root cause is that currently we only compared the literal sub pattern, the missing check is:

  • For pure ASCII input: the length of input equals to the length of pattern
  • For unicode input: when all sub-patterns matches, the cursor should reach the end of the input

Should I re-submit a PR with the bug fixed?

@mbasmanova
Copy link
Contributor

@xumingming James, thank you for taking a look. Feel free to re-submit the PR with the bug fixed and tests extended to catch it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants