Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance split Spark function to support regex #6155

Closed
wants to merge 2 commits into from

Conversation

unigof
Copy link

@unigof unigof commented Aug 18, 2023

Description

Split function just support split string by a single character for now.
This pr to support split by regex string.

split(str, regex)
    Splits `str` around occurrences that match `regex` and returns an array as many times as possible
    Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a regular expression. It supported by RE2.

Examples:

 SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""]
 SELECT split('one', ''); -- ["o", "n", "e", ""]
 SELECT split('one', '1'); -- ["one"]

Notice:

There are some semantic diff between Java regex and re2 regex. And the lookahead/lookbehind patterns are not supported by RE2. If used, there will be a runtime exception (compile pattern error). e.g. :

  • Positive Lookahead: (?=regex)
  • Negative Lookahead: (?!regex)
  • Positive Lookbehind: (?<=regex)
  • Negative Lookbehind:(?<!regex)

#6017

@netlify
Copy link

netlify bot commented Aug 18, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9d97c73
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/665e75cb46ae0100083afe7c

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2023
@unigof
Copy link
Author

unigof commented Aug 25, 2023

Hi @mbasmanova @beroyfb ,could you help me review this pr? Thank you very much

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unigof Please, review contributing guide and coding style docs and update the PR to comply.

@unigof unigof changed the title velox split function support pattern of string type Enhance split Spark function to support regex Sep 14, 2023
@unigof
Copy link
Author

unigof commented Sep 14, 2023

Hi @mbasmanova @PHILO-HE @jackylee-ch , can you review this pr? Thank you

@@ -131,21 +131,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg

.. spark:function:: split(string, delimiter) -> array(string)

Splits ``string`` on ``delimiter`` and returns an array. ::
Splits ``string`` on ``delimiter`` as much as possible and returns an array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the original implementation wrong? It treated 'delimiter' as a strings, but it should have treated it as regular expression.

Does Spark use Joni for regex or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.

I just found spark uses JDK's regex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In velox, split function treated 'delimiter' as a char, but in spark it treated 'delimiter' as a strings and support regular expression.

velox/docs/functions/spark/string.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/string.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/string.rst Show resolved Hide resolved
@@ -131,21 +131,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg

.. spark:function:: split(string, delimiter) -> array(string)

Splits ``string`` on ``delimiter`` and returns an array. ::
Splits ``string`` on ``delimiter`` as much as possible and returns an array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.

I just found spark uses JDK's regex.

velox/functions/lib/Re2Functions.cpp Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.

Let's update PR description to clarify this.

I just found spark uses JDK's regex.

I see. Do you know whether it uses Joni, RE2 or some other regex engine?

@PHILO-HE
Copy link
Contributor

I just found spark uses JDK's regex.

I see. Do you know whether it uses Joni, RE2 or some other regex engine?

Neither Joni nor RE2. It looks JDK's own regex engine. See link.

@mbasmanova
Copy link
Contributor

Neither Joni nor RE2. It looks JDK's own regex engine.

Interesting. Isn't it very slow (especially compared with RE2)?

@PHILO-HE
Copy link
Contributor

Neither Joni nor RE2. It looks JDK's own regex engine.

Interesting. Isn't it very slow (especially compared with RE2)?

Yes, I also guess it is very slow. And RE2 claimed here.

@unigof
Copy link
Author

unigof commented Sep 20, 2023

hi @mbasmanova , could you merge this pr, if no other problems?

@mbasmanova
Copy link
Contributor

@rui-mo Rui, can you help review this PR?

@@ -131,14 +131,24 @@ Unless specified otherwise, all functions return NULL if at least one of the arg

.. spark:function:: split(string, delimiter) -> array(string)

Splits ``string`` on ``delimiter`` and returns an array. ::
Splits ``string`` on ``delimiter`` and returns an array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use regex instead of delimiter.

Suggested description:
Returns an array by splitting string as many times as possible. The delimiter is any string matching regex.


SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""]
SELECT split('one', ''); -- ["o", "n", "e", ""]
SELECT split('one', '1'); -- ["one"]

.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:
.. spark:function:: split(string, delimiter, limit) -> array(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use regex instead.

.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:
.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:
.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:

Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description should also be refined like the 1st function version. And it is better to tell user the first version is equivalent to -1 is used for limit.

EvalCtx& context,
VectorPtr& resultRef) const final {
try {
checkForBadPattern(re_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to move this check to makeRe2SplitAll to get bad pattern checked one time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand your means, but I think may it is better to keep it up now, because there is context in apply method, it can record exception for bad pattern.
If move checkForBadPattern to makeRe2SplitAll, maybe unable to find problems conveniently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not a strong reason. As the pattern is always constant, I think it may be not necessary to do this check for each incoming batch. cc @mbasmanova

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pattern is constant, then let's make sure to check it once during function creation and use AlwaysFailingVectorFunction.

``string`` - a string expression to split, constant
``delimiter`` - a string representing a regular expression, supported by re2.
Notice:
There are some usage not supported by re2 regex. If used, there will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this clarification. RE2 indeed has a few incompatible patterns with other libs. It may be better to just clarify them at some common place, instead of for only a certain regex function.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``string`` - a string expression to split, constant
``delimiter`` - a string representing a regular expression, supported by re2.
Notice:
There are some usage not supported by re2 regex. If used, there will be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these unsupported patterns needed by Spark? We may add unit tests for them if exception is expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark not support these patterns, just split string limit times

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any pattern supported by JDK's regex in Spark, but not supported by RE2?

.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:
.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:

Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also document the behavior when limit is negative? We may also need to add corresponding tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@unigof
Copy link
Author

unigof commented Sep 26, 2023

hi @rui-mo , can you review this pr again? thanks~


Splits ``string`` on ``delimiter`` and returns an array. ::
Returns an array by splitting ``string`` as many times as possible.
The ``regex`` is any string matching regex, supported by re2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested statement:
The delimiter is any string matching regex

EvalCtx& context,
VectorPtr& resultRef) const final {
try {
checkForBadPattern(re_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not a strong reason. As the pattern is always constant, I think it may be not necessary to do this check for each incoming batch. cc @mbasmanova

@@ -135,6 +135,21 @@ std::shared_ptr<exec::VectorFunction> makeRe2ExtractAll(

std::vector<std::shared_ptr<exec::FunctionSignature>> re2ExtractAllSignatures();

/// re2SplitAll(string, pattern) → array<string>
///
/// If string has a substring that matches the given pattern, then will be split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to re-use some statements in string.rst to document it.

arrayWriter.add_item().setNoCopy(
StringView(remaining.data(), remaining.size()));
} else if (pos == input.size()) {
arrayWriter.add_item().setNoCopy(StringView(nullptr, 0));

This comment was marked as outdated.

const std::vector<VectorFunctionArg>& inputArgs,
const core::QueryConfig& /*config*/) {
auto numArgs = inputArgs.size();
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use VELOX_USER_CHECK_EQ.

},
[&inputs](vector_size_t row) { return !inputs[row].has_value(); });

// Constant pattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add . at the end of all comments.


TEST_F(Re2FunctionsTest, regexSplitAllNonAscii) {
testRe2SplitAll(
// split('苹果香蕉velox橘子', '香蕉')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these commented cases needed?

@@ -730,8 +730,6 @@ void re2SplitAll(
const re2::StringPiece remaining = input.substr(pos);
arrayWriter.add_item().setNoCopy(
StringView(remaining.data(), remaining.size()));
} else if (pos == input.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks relevant to test failure. Looks this if branch is necessary. I thought wrongly.

@FelixYBW
Copy link

Is the function still tracked?

Copy link

stale bot commented Mar 22, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added stale and removed stale labels Mar 22, 2024
@jackylee-ch
Copy link
Contributor

cc @PHILO-HE @rui-mo

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good! Some comments. Please check whether they make sense. Thanks!

BTW, please squash historical commits into one or fewer ones (the community prefers that).

const re2::StringPiece remaining = input.substr(pos);
arrayWriter.add_item().setNoCopy(
StringView(remaining.data(), remaining.size()));
} else if (pos == input.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if -> else?

:noindex:

Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. ::
Splits ``string`` on ``regex`` and returns an array of size at most ``limit``.
If limit is negative, ``string`` will be split as many times as possible. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add "``" to surround word "limit".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If limit is zero, the behavior is like negative. So let's clarify zero case.
See spark doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems split with limit has not been supported and it will be covered by another PR: #8825. Maybe, we can remove the above relevant changes, and let that PR fix (if you agree, please ignore the above relevant comments).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems split with limit has not been supported and it will be covered by another PR: #8825. Maybe, we can remove the above relevant changes, and let that PR fix (if you agree, please ignore the above relevant comments).

It's ok to me

:noindex:

Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. ::
Splits ``string`` on ``regex`` and returns an array of size at most ``limit``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested statement:
** returns an array whose size is ``limit`` at most.

try {
return std::make_shared<Re2SplitAllConstantPattern>(pattern);
} catch (...) {
return std::make_shared<exec::AlwaysFailingVectorFunction>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the exception only comes from checkForBadPattern. I remember makeRe2SplitAll can be called from Gluten's plan validation path. So it's better to remove catch to let exception thrown, then to make fallback happen on Gluten side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have regex check in Gluten, thus we can use AlwaysFailingVectorFunction here.

/// The pattern is any string matching regex.
///
/// If the pattern is invalid or not constant, throws an exception.
/// If the pattern does not match, returns original string as array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document these cases in string.rst.

@jackylee-ch jackylee-ch force-pushed the dev_split branch 3 times, most recently from fbcec03 to 926b2e9 Compare April 14, 2024 16:33
@jackylee-ch
Copy link
Contributor

cc @PHILO-HE @rui-mo

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylee-ch, I meant we can only update the doc for this function without limit arg, as I note limit arg is not considered in this pr. Right?
Except this, the pr looks good! Thanks!

@jackylee-ch
Copy link
Contributor

@jackylee-ch, I meant we can only update the doc for this function without limit arg, as I note limit arg is not considered in this pr. Right? Except this, the pr looks good! Thanks!

The limit in split is not supported in current main branch, thus we need remove the split doc that contains limit.

@FelixYBW
Copy link

@PHILO-HE is the PR related to Cody's re2 support?

@jackylee-ch jackylee-ch force-pushed the dev_split branch 2 times, most recently from d513611 to 5b627d6 Compare April 29, 2024 02:33
auto pattern = constantPattern->as<ConstantVector<StringView>>()->valueAt(0);

try {
return std::make_shared<Re2SplitAllConstantPattern>(pattern);
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so looks current implement only support constant pattern, how about if no constant ?

@unigof , I have a PR #10248 to cover any (constant or non-constant) regex pattern and also limit parameter , but since you already started this part work, does it ok you apply the PR change into your PR ?

Copy link
Collaborator

@rui-mo rui-mo Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu Just notice non-constant vector is also supported in your PR. Please feel free to continue its support in it if needed. The handling of some corner cases could be referred from mine https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132. Thanks.

Copy link

stale bot commented Sep 18, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Sep 18, 2024
@stale stale bot closed this Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants