Optimize LIKE for more relaxed patterns #8050

xumingming · 2023-12-14T15:25:43Z

In this PR we optimize LIKE operations for patterns which I call them
kRelaxed[Prefix|Suffix] patterns, e.g.

kRelaxedPrefix: _a_bc%%
kRelaxedSuffix: %%_a_bc

'Relaxed' here means there is less restrictions than their counterparts.
The algorithm of recognizing these relaxed patterns can be explained by
an example, say we have a pattern ___hello___%%, it is split into 4
sub-patterns:

[0] kSingleCharWildcard: ___
[1] kLiteralString: hello
[2] kSingleCharWildcard: ___
[3] kAnyCharsWildcard: %%

Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can
determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed
sub-patterns to do the matching.

The benchmark result:

Before(kGeneric):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.34s   747.38m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                  340.30ms      2.94
like_prefix##like_relaxed_prefix_1                        334.77ms      2.99
like_prefix##like_relaxed_prefix_2                        350.70ms      2.85
like_prefix##starts_with                                    5.35ms    187.05
like_substring##like_substring                               1.26s   790.87m
like_substring##strpos                                     20.55ms     48.67
like_suffix##like_suffix                                  957.06ms      1.04
like_suffix##like_relaxed_suffix_1                        935.90ms      1.07
like_suffix##like_relaxed_suffix_2                           1.08s   926.79m
like_suffix##ends_with                                      5.35ms    187.07

After(kRelaxedPrefix, kRelaxedSuffix):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.48s   674.92m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                    7.05ms    141.80
like_prefix##like_relaxed_prefix_1                          9.06ms    110.36
like_prefix##like_relaxed_prefix_2                          8.55ms    116.94
like_prefix##starts_with                                    5.34ms    187.22
like_substring##like_substring                             22.47ms     44.50
like_substring##strpos                                     20.72ms     48.27
like_suffix##like_suffix                                    7.05ms    141.82
like_suffix##like_relaxed_suffix_1                          9.08ms    110.16
like_suffix##like_relaxed_suffix_2                          8.52ms    117.30
like_suffix##ends_with                                      5.35ms    187.07

The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.

netlify · 2023-12-14T15:25:49Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`4fa5670`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65b3707501c83b0008294b59

mbasmanova · 2023-12-14T17:43:11Z

@xumingming Thank you for working on this.

kRelaxedPrefix: _a_bc%%
kRelaxedSuffix: %%_a_bc
kRelaxedSubstring: %%_a_bc%%

I haven't read the code yet, but I understand how to handle first 2 patterns, but I'm not sure what's the approach for implementing efficient matching for the last pattern. Would you clarify?

mbasmanova · 2023-12-14T17:59:26Z

velox/functions/lib/Re2Functions.cpp

+  while (indexInString <= end) {
+    // Search the firstFixedString to find out where to start match the whole
+    // pattern.
+    auto it = strstr(input.begin() + indexInString, firstFixedString);


Is it safe to apply strstr to non-zero-terminated strings. I feel this might not be safe. I'm also concerned about the performance of some edge cases where we need to keep running this loop many time shifting one char at a time. Would it be OK to not support this pattern, at least not in this PR, and focus on simple prefix and suffix patterns?

mbasmanova · 2023-12-14T18:00:51Z

velox/functions/lib/Re2Functions.cpp

+            startPatternIndex)) {
+      return true;
+    } else {
+      // Not match the whole pattern, advance the cursor.


This can be very slow for some cases like...

'aaaaaaa....aaaaa` LIKE '%a_b%'

Um... you are right, it is indeed slow for this kind of cases, will move kRelaxedSubstring optimization into separate PR.

xumingming · 2023-12-15T05:46:23Z

@mbasmanova Removed kRelaxedSubstring optimization from this PR and updated description.

mbasmanova

@xumingming Thank you for iterating. Have a question.

velox/functions/lib/Re2Functions.h

mbasmanova · 2023-12-15T08:11:55Z

velox/functions/lib/Re2Functions.h

+  /// k[Relaxed]Prefix, k[Relaxed]Suffix and k[Relaxed]Substring.
+  std::string fixedPattern_;
+  /// Contains the sub-patterns for k[Relaxed]Xxx patterns.
+  std::vector<SubPatternMetadata> subPatterns_;


This works, but I wonder if a simpler way to model this would be to have a single string (h_e_ll_o) and a vector of pairs of start + length to indicate a list of substrings to match:

h_e_ll_o
[[1, 1], [3, 1], [5, 2], [7, 1]]

This way we do not need to create lots of small strings.

I noticed that BM shows that relaxes patterns a bit slower than fixed. Do you happen to know where the slowness comes from?

@mbasmanova Simplied the modeling as you suggested.

For the performance diff, I think its due to the complexity of relaxed pattern, take kPrefix and kRelaxedPrefix as example:

For kPrefix, for every input row, we only need to compare the prefix string once.

For kRelaxedPrefix, we need to access PatternMetadata::subPatternKinds_, PatternMetadata::subPatternRanges_ and do more string matching.

Might need to generate a flamegraph to confirm.

mbasmanova · 2023-12-15T08:12:44Z

velox/functions/lib/Re2Functions.cpp

+bool matchRelaxedFixedPattern(
+    StringView input,
+    const std::vector<SubPatternMetadata>& subPatterns,
+    size_t length,


nit: perhaps, change the order of arguments to start, length (seems more conventional)

mbasmanova · 2023-12-15T08:13:31Z

velox/functions/lib/Re2Functions.cpp

+    const std::vector<SubPatternMetadata>& subPatterns,
+    size_t length,
+    size_t start,
+    size_t startPatternIndex = 0) {


Is this parameter used?

mbasmanova · 2023-12-15T08:14:07Z

velox/functions/lib/Re2Functions.cpp

+
+  auto indexInString = start;
+  for (auto i = startPatternIndex; i < subPatterns.size(); i++) {
+    auto& subPattern = subPatterns[i];


mbasmanova · 2023-12-15T08:15:14Z

velox/functions/lib/Re2Functions.cpp

+    StringView input,
+    const std::vector<SubPatternMetadata>& subPatterns,
+    size_t length) {
+  if (input.size() < length) {


we can remove this check since it is already included in matchRelaxedFixedPattern

we can remove this function altogether I think.

mbasmanova · 2023-12-15T08:15:21Z

velox/functions/lib/Re2Functions.cpp

+    StringView input,
+    const std::vector<SubPatternMetadata>& subPatterns,
+    size_t length) {
+  if (input.size() < length) {


mbasmanova · 2023-12-15T18:07:41Z

velox/functions/lib/Re2Functions.h

+  std::string fixedPattern_;
+
+  /// Contains the sub-pattern kinds for k[Relaxed]Xxx patterns.
+  std::vector<SubPatternKind> subPatternKinds_;


Do we need these? We can only have single wilfcards and fixed strings. It seems sufficient to track ranges of fixed strings, no?

We need to which range need to do exact match(e.g. 'abc'), which range only need to skip a certain length of string(e.g. '___', wildcards), only store range info is not eough?

I'm thinking that there are only 2 possibilities: either skip or match. Hence, we can list only the ranges to match and assume the rest is to skip.

[[1, 3], [5, 4], [10, 7]] + pattern length 20 means

make sure there are at least 20 chars, then
skip to 1
match 3 chars
skip to 5
match 4 chars
skip to 10
match 7 chars

Would that work?

mbasmanova · 2023-12-15T18:08:39Z

velox/functions/lib/Re2Functions.cpp

+  }
+
+  auto indexInString = start;
+  for (auto i = 0; i < patternMetadata.numSubPatterns(); i++) {


This loop can be simplified if we just keep a list of ranges for fixed strings.

xumingming · 2023-12-19T08:00:58Z

@mbasmanova All comments addressed.

facebook-github-bot · 2023-12-20T00:26:54Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova · 2023-12-20T00:28:26Z

The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.

@xumingming The original optimization showed speed up of ~20x, but this is one showing much higher speedup. Is there some explanation?

mbasmanova

@xumingming Overall looks good. Would you rebase?

mbasmanova · 2023-12-20T00:30:24Z

velox/functions/lib/Re2Functions.cpp

+  }
+
+  // Compare each literal range.
+  for (auto i = 0; i < patternMetadata.numLiteralRanges(); i++) {


nit: Wondering if the code can be simplified by iterating over the vector of ranges.

for (const auto& range : patternMetadata.literalRanges()) { const auto start = range.first; const auto length = range.second; ....memcmp... }

mbasmanova · 2023-12-20T00:31:47Z

velox/functions/lib/Re2Functions.cpp

+    count++;
+    if (firstIndex == -1) {
+      firstIndex = index;
+      lastIndex = index;


this code is the same in both if and else branches; consider moving it outside of the if-else block

velox/functions/lib/Re2Functions.cpp

mbasmanova · 2023-12-20T00:46:52Z

Seeing lot of linter warnings. Some examples:

mbasmanova · 2023-12-20T00:47:47Z

FYI, benchmark results I'm seeing

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.10s   906.81m
like_prefix##like_prefix                                   12.77ms     78.32
like_prefix##like_relaxed_prefix_1                         16.32ms     61.26
like_prefix##like_relaxed_prefix_2                         13.57ms     73.69
like_prefix##starts_with                                   13.21ms     75.69
like_substring##like_substring                             28.42ms     35.19
like_substring##strpos                                     22.80ms     43.86
like_suffix##like_suffix                                   13.63ms     73.38
like_suffix##like_relaxed_suffix_1                         17.26ms     57.95
like_suffix##like_relaxed_suffix_2                         14.42ms     69.35
like_suffix##ends_with                                     13.31ms     75.13

mbasmanova · 2023-12-20T01:11:05Z

velox/functions/lib/tests/Re2FunctionsTest.cpp

@@ -110,8 +110,20 @@ class Re2FunctionsTest : public test::FunctionBaseTest {
            std::optional<std::string>{input});
      };

+      // Print the failing pattern to make it easier to locate the failed case.


May use SCOPED_TRACE: https://github.com/google/googletest/blob/main/docs/advanced.md#adding-traces-to-assertions

facebook-github-bot · 2024-01-23T16:45:52Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xumingming · 2024-01-24T14:35:49Z

@xumingming Thank you for working on this. Looks fine to me % some nits. I feel we'll have to refactor the code for readability if we decide to add more logic.

@mbasmanova Sure, I plan to implement kRelaxedSubstring, I can do some refactoring before that.

mbasmanova · 2024-01-25T06:37:25Z

I plan to implement kRelaxedSubstring, I can do some refactoring before that.

@xumingming Sounds good. Thanks.

xumingming · 2024-01-25T12:18:57Z

Updated the code to resolve all the review comments and rebased main.

mbasmanova · 2024-01-25T13:07:59Z

@xumingming format-check is failing. Would you take a look?

Curious, would you like to write a blog post about all these optimizations for LIKE?

In this PR we optimize LIKE operations for patterns which I call them kRelaxed[Prefix|Suffix] patterns, e.g. - kRelaxedPrefix: _a_bc%% - kRelaxedSuffix: %%_a_bc 'Relaxed' here means there is less restrictions than their counterparts. The algorithm of recognizing these relaxed patterns can be explained by an example, say we have a pattern '___hello___%%', it is split into 4 sub patterns: - [0] kSingleCharWildcard: ___ - [1] kLiteralString: hello - [2] kSingleCharWildcard: ___ - [3] kAnyCharsWildcard: %% Since the 'kAnyCharsWildcard' only occurs at the end of the pattern, we can determine it is a kRelaxedPrefix pattern, and then use the first 3 fixed sub-patterns to do the matching. The benchmark result: Before(kGeneric): ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ like_generic##like_generic 1.34s 747.38m ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- like_prefix##like_prefix 340.30ms 2.94 like_prefix##like_relaxed_prefix_1 334.77ms 2.99 like_prefix##like_relaxed_prefix_2 350.70ms 2.85 like_prefix##starts_with 5.35ms 187.05 like_substring##like_substring 1.26s 790.87m like_substring##strpos 20.55ms 48.67 like_suffix##like_suffix 957.06ms 1.04 like_suffix##like_relaxed_suffix_1 935.90ms 1.07 like_suffix##like_relaxed_suffix_2 1.08s 926.79m like_suffix##ends_with 5.35ms 187.07 ``` After(kRelaxedPrefix, kRelaxedSuffix): ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ like_generic##like_generic 1.48s 674.92m ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- like_prefix##like_prefix 7.05ms 141.80 like_prefix##like_relaxed_prefix_1 9.06ms 110.36 like_prefix##like_relaxed_prefix_2 8.55ms 116.94 like_prefix##starts_with 5.34ms 187.22 like_substring##like_substring 22.47ms 44.50 like_substring##strpos 20.72ms 48.27 like_suffix##like_suffix 7.05ms 141.82 like_suffix##like_relaxed_suffix_1 9.08ms 110.16 like_suffix##like_relaxed_suffix_2 8.52ms 117.30 like_suffix##ends_with 5.35ms 187.07 ``` The speedup for kRelaxedPrefix is about 40x, speedup for kRelaxedSuffix is about 100x.

xumingming · 2024-01-25T13:15:47Z

@xumingming format-check is failing. Would you take a look?

Curious, would you like to write a blog post about all these optimizations for LIKE?

Fixed the format issue, and glad to write a blog post for LIKE :)

facebook-github-bot · 2024-01-25T13:19:21Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova · 2024-01-25T13:20:58Z

glad to write a blog post for LIKE :)

Great. Here is an example of a PR that adds a blog post:

#6851

and here is a link to the blog:

https://velox-lib.io/blog/reduce-agg

Looking forward to a blog post about LIKE optimizations. Thanks.

mbasmanova · 2024-01-25T16:15:07Z

Seeing errors:

fbcode/velox/functions/lib/Re2Functions.cpp:1516:24: runtime error: load of value 272, which is not a valid value for type 'facebook::velox::functions::SubPatternKind'
    #0 0x7f732f7e4ee7 in facebook::velox::functions::parsePattern[abi:cxx11](std::basic_string_view<char, std::char_traits<char>>, std::optional<char>, std::vector<facebook::velox::functions::SubPatternKind, std::allocator<facebook::velox::functions::SubPatternKind>>&, std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long>>>&) fbcode/velox/functions/lib/Re2Functions.cpp:1516

SUMMARY: UndefinedBehaviorSanitizer: invalid-enum-load fbcode/velox/functions/lib/Re2Functions.cpp:1516:24 in

xumingming · 2024-01-26T08:41:59Z

Seeing errors:

fbcode/velox/functions/lib/Re2Functions.cpp:1516:24: runtime error: load of value 272, which is not a valid value for type 'facebook::velox::functions::SubPatternKind'
    #0 0x7f732f7e4ee7 in facebook::velox::functions::parsePattern[abi:cxx11](std::basic_string_view<char, std::char_traits<char>>, std::optional<char>, std::vector<facebook::velox::functions::SubPatternKind, std::allocator<facebook::velox::functions::SubPatternKind>>&, std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long>>>&) fbcode/velox/functions/lib/Re2Functions.cpp:1516

SUMMARY: UndefinedBehaviorSanitizer: invalid-enum-load fbcode/velox/functions/lib/Re2Functions.cpp:1516:24 in

@mbasmanova It is caused by the initialization logic of previousKind, fixed.

facebook-github-bot · 2024-01-26T13:50:43Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-01-26T16:52:14Z

@mbasmanova merged this pull request in ec3b082.

conbench-facebook · 2024-01-26T17:16:13Z

Conbench analyzed the 1 benchmark run on commit ec3b082f.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

mbasmanova · 2024-01-29T19:02:09Z

@xumingming James, we are seeing correctness issues in production. Here is a simple repro:

TEST_F(Re2FunctionsTest, xxx) {
  auto x = evaluateOnce<bool, std::string>(
      "c0 like 'fblearner_'", "fblearner_global");
  EXPECT_FALSE(x);

  x = evaluateOnce<bool, std::string>("c0 like 'fblearner_'", "fblearner");
  EXPECT_FALSE(x);

  x = evaluateOnce<bool, std::string>("c0 like 'fblearner_'", "fblearner_");
  EXPECT_TRUE(x);
}

We are going to revert this change.

Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Differential Revision: D53190666

…#8585) Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Differential Revision: D53190666

…#8585) Summary: See facebookincubator#8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Reviewed By: xiaoxmeng, bikramSingh91 Differential Revision: D53190666

Summary: Pull Request resolved: #8585 See #8050 Original commit changeset: c307f3f93949 Original Phabricator Diff: D52312645 Reviewed By: xiaoxmeng, bikramSingh91 Differential Revision: D53190666 fbshipit-source-id: 3b6b741b812cc3aa0f08584fde7d3106d4e2ba28

facebook-github-bot · 2024-01-29T23:51:01Z

This pull request has been reverted by c196497.

xumingming · 2024-01-30T00:16:34Z

@mbasmanova Sorry for the inconvenience , the failing patterns are kRelaxedFixed, the root cause is that currently we only compared the literal sub pattern, the missing check is:

For pure ASCII input: the length of input equals to the length of pattern
For unicode input: when all sub-patterns matches, the cursor should reach the end of the input

Should I re-submit a PR with the bug fixed?

mbasmanova · 2024-01-30T00:21:28Z

@xumingming James, thank you for taking a look. Feel free to re-submit the PR with the bug fixed and tests extended to catch it.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 14, 2023

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from 8292717 to 953f09b Compare December 14, 2023 15:34

mbasmanova requested review from spershin, Yuhta, laithsakka and aditi-pandit December 14, 2023 17:41

mbasmanova reviewed Dec 14, 2023

View reviewed changes

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch 3 times, most recently from 234d5bd to c278222 Compare December 15, 2023 05:45

mbasmanova reviewed Dec 15, 2023

View reviewed changes

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from c278222 to 39199a3 Compare December 15, 2023 17:48

mbasmanova reviewed Dec 15, 2023

View reviewed changes

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch 4 times, most recently from 0e29466 to 0eebfa4 Compare December 18, 2023 08:46

mbasmanova reviewed Dec 20, 2023

View reviewed changes

mbasmanova approved these changes Dec 20, 2023

View reviewed changes

mbasmanova reviewed Dec 20, 2023

View reviewed changes

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from dddd77c to 4807393 Compare January 25, 2024 12:14

xumingming force-pushed the like_opt_more_flexible_patterns_2 branch from 4807393 to b8f5280 Compare January 25, 2024 13:13

Fix previousKind initialization

4fa5670

facebook-github-bot closed this in ec3b082 Jan 26, 2024

facebook-github-bot added the Merged label Jan 26, 2024

mbasmanova mentioned this pull request Jan 29, 2024

Back out "Optimize LIKE for more relaxed patterns" #8585

Closed

facebook-github-bot added the Reverted label Jan 29, 2024

Optimize LIKE for more relaxed patterns #8050

Optimize LIKE for more relaxed patterns #8050

Conversation

xumingming commented Dec 14, 2023 • edited by mbasmanova Loading

netlify bot commented Dec 14, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

mbasmanova commented Dec 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xumingming commented Dec 15, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xumingming Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xumingming commented Dec 19, 2023

facebook-github-bot commented Dec 20, 2023

mbasmanova commented Dec 20, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova commented Dec 20, 2023

mbasmanova commented Dec 20, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Jan 23, 2024

xumingming commented Jan 24, 2024

mbasmanova commented Jan 25, 2024

xumingming commented Jan 25, 2024

mbasmanova commented Jan 25, 2024

xumingming commented Jan 25, 2024

facebook-github-bot commented Jan 25, 2024

mbasmanova commented Jan 25, 2024

mbasmanova commented Jan 25, 2024

xumingming commented Jan 26, 2024

facebook-github-bot commented Jan 26, 2024

facebook-github-bot commented Jan 26, 2024

conbench-facebook bot commented Jan 26, 2024

mbasmanova commented Jan 29, 2024

facebook-github-bot commented Jan 29, 2024

xumingming commented Jan 30, 2024

mbasmanova commented Jan 30, 2024

xumingming commented Dec 14, 2023 •

edited by mbasmanova

Loading

netlify bot commented Dec 14, 2023 •

edited

Loading

xumingming Dec 15, 2023 •

edited

Loading