-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance split Spark function to support regex #6155
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Hi @mbasmanova @beroyfb ,could you help me review this pr? Thank you very much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unigof Please, review contributing guide and coding style docs and update the PR to comply.
Hi @mbasmanova @PHILO-HE @jackylee-ch , can you review this pr? Thank you |
@@ -131,21 +131,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
|
|||
.. spark:function:: split(string, delimiter) -> array(string) | |||
|
|||
Splits ``string`` on ``delimiter`` and returns an array. :: | |||
Splits ``string`` on ``delimiter`` as much as possible and returns an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the original implementation wrong? It treated 'delimiter' as a strings, but it should have treated it as regular expression.
Does Spark use Joni for regex or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.
I just found spark uses JDK's regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In velox, split function treated 'delimiter' as a char, but in spark it treated 'delimiter' as a strings and support regular expression.
@@ -131,21 +131,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
|
|||
.. spark:function:: split(string, delimiter) -> array(string) | |||
|
|||
Splits ``string`` on ``delimiter`` and returns an array. :: | |||
Splits ``string`` on ``delimiter`` as much as possible and returns an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.
I just found spark uses JDK's regex.
Let's update PR description to clarify this.
I see. Do you know whether it uses Joni, RE2 or some other regex engine? |
Neither Joni nor RE2. It looks JDK's own regex engine. See link. |
Interesting. Isn't it very slow (especially compared with RE2)? |
Yes, I also guess it is very slow. And RE2 claimed here. |
hi @mbasmanova , could you merge this pr, if no other problems? |
@rui-mo Rui, can you help review this PR? |
@@ -131,14 +131,24 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
|
|||
.. spark:function:: split(string, delimiter) -> array(string) | |||
|
|||
Splits ``string`` on ``delimiter`` and returns an array. :: | |||
Splits ``string`` on ``delimiter`` and returns an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to use regex
instead of delimiter
.
Suggested description:
Returns an array by splitting string
as many times as possible. The delimiter is any string matching regex
.
|
||
SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""] | ||
SELECT split('one', ''); -- ["o", "n", "e", ""] | ||
SELECT split('one', '1'); -- ["one"] | ||
|
||
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
.. spark:function:: split(string, delimiter, limit) -> array(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No extra space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use regex instead.
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
|
||
Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description should also be refined like the 1st function version. And it is better to tell user the first version is equivalent to -1 is used for limit
.
velox/functions/lib/Re2Functions.cpp
Outdated
EvalCtx& context, | ||
VectorPtr& resultRef) const final { | ||
try { | ||
checkForBadPattern(re_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to move this check to makeRe2SplitAll
to get bad pattern checked one time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand your means, but I think may it is better to keep it up now, because there is context
in apply
method, it can record exception for bad pattern.
If move checkForBadPattern
to makeRe2SplitAll
, maybe unable to find problems conveniently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks not a strong reason. As the pattern is always constant, I think it may be not necessary to do this check for each incoming batch. cc @mbasmanova
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pattern is constant, then let's make sure to check it once during function creation and use AlwaysFailingVectorFunction.
``string`` - a string expression to split, constant | ||
``delimiter`` - a string representing a regular expression, supported by re2. | ||
Notice: | ||
There are some usage not supported by re2 regex. If used, there will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this clarification. RE2 indeed has a few incompatible patterns with other libs. It may be better to just clarify them at some common place, instead of for only a certain regex function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add Spark's implementation to the PR description? Thanks.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L526
``string`` - a string expression to split, constant | ||
``delimiter`` - a string representing a regular expression, supported by re2. | ||
Notice: | ||
There are some usage not supported by re2 regex. If used, there will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these unsupported patterns needed by Spark? We may add unit tests for them if exception is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark not support these patterns, just split string
limit
times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any pattern supported by JDK's regex in Spark, but not supported by RE2?
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
.. spark:function:: split(string, delimiter, limit) -> array(string) | ||
:noindex: | ||
|
||
Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also document the behavior when limit is negative? We may also need to add corresponding tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
hi @rui-mo , can you review this pr again? thanks~ |
|
||
Splits ``string`` on ``delimiter`` and returns an array. :: | ||
Returns an array by splitting ``string`` as many times as possible. | ||
The ``regex`` is any string matching regex, supported by re2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested statement:
The delimiter is any string matching regex
velox/functions/lib/Re2Functions.cpp
Outdated
EvalCtx& context, | ||
VectorPtr& resultRef) const final { | ||
try { | ||
checkForBadPattern(re_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks not a strong reason. As the pattern is always constant, I think it may be not necessary to do this check for each incoming batch. cc @mbasmanova
velox/functions/lib/Re2Functions.h
Outdated
@@ -135,6 +135,21 @@ std::shared_ptr<exec::VectorFunction> makeRe2ExtractAll( | |||
|
|||
std::vector<std::shared_ptr<exec::FunctionSignature>> re2ExtractAllSignatures(); | |||
|
|||
/// re2SplitAll(string, pattern) → array<string> | |||
/// | |||
/// If string has a substring that matches the given pattern, then will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to re-use some statements in string.rst
to document it.
arrayWriter.add_item().setNoCopy( | ||
StringView(remaining.data(), remaining.size())); | ||
} else if (pos == input.size()) { | ||
arrayWriter.add_item().setNoCopy(StringView(nullptr, 0)); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
velox/functions/lib/Re2Functions.cpp
Outdated
const std::vector<VectorFunctionArg>& inputArgs, | ||
const core::QueryConfig& /*config*/) { | ||
auto numArgs = inputArgs.size(); | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use VELOX_USER_CHECK_EQ
.
}, | ||
[&inputs](vector_size_t row) { return !inputs[row].has_value(); }); | ||
|
||
// Constant pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add .
at the end of all comments.
|
||
TEST_F(Re2FunctionsTest, regexSplitAllNonAscii) { | ||
testRe2SplitAll( | ||
// split('苹果香蕉velox橘子', '香蕉') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these commented cases needed?
velox/functions/lib/Re2Functions.cpp
Outdated
@@ -730,8 +730,6 @@ void re2SplitAll( | |||
const re2::StringPiece remaining = input.substr(pos); | |||
arrayWriter.add_item().setNoCopy( | |||
StringView(remaining.data(), remaining.size())); | |||
} else if (pos == input.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks relevant to test failure. Looks this if
branch is necessary. I thought wrongly.
Is the function still tracked? |
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically looks good! Some comments. Please check whether they make sense. Thanks!
BTW, please squash historical commits into one or fewer ones (the community prefers that).
velox/functions/lib/Re2Functions.cpp
Outdated
const re2::StringPiece remaining = input.substr(pos); | ||
arrayWriter.add_item().setNoCopy( | ||
StringView(remaining.data(), remaining.size())); | ||
} else if (pos == input.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if -> else?
:noindex: | ||
|
||
Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. :: | ||
Splits ``string`` on ``regex`` and returns an array of size at most ``limit``. | ||
If limit is negative, ``string`` will be split as many times as possible. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add "``" to surround word "limit".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If limit is zero, the behavior is like negative. So let's clarify zero case.
See spark doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems split
with limit has not been supported and it will be covered by another PR: #8825. Maybe, we can remove the above relevant changes, and let that PR fix (if you agree, please ignore the above relevant comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems
split
with limit has not been supported and it will be covered by another PR: #8825. Maybe, we can remove the above relevant changes, and let that PR fix (if you agree, please ignore the above relevant comments).
It's ok to me
:noindex: | ||
|
||
Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. :: | ||
Splits ``string`` on ``regex`` and returns an array of size at most ``limit``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested statement:
** returns an array whose size is ``limit`` at most.
try { | ||
return std::make_shared<Re2SplitAllConstantPattern>(pattern); | ||
} catch (...) { | ||
return std::make_shared<exec::AlwaysFailingVectorFunction>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks the exception only comes from checkForBadPattern
. I remember makeRe2SplitAll
can be called from Gluten's plan validation path. So it's better to remove catch
to let exception thrown, then to make fallback happen on Gluten side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have regex check in Gluten, thus we can use AlwaysFailingVectorFunction
here.
/// The pattern is any string matching regex. | ||
/// | ||
/// If the pattern is invalid or not constant, throws an exception. | ||
/// If the pattern does not match, returns original string as array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document these cases in string.rst.
fbcec03
to
926b2e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackylee-ch, I meant we can only update the doc for this function without limit
arg, as I note limit
arg is not considered in this pr. Right?
Except this, the pr looks good! Thanks!
The limit in split is not supported in current |
@PHILO-HE is the PR related to Cody's re2 support? |
d513611
to
5b627d6
Compare
auto pattern = constantPattern->as<ConstantVector<StringView>>()->valueAt(0); | ||
|
||
try { | ||
return std::make_shared<Re2SplitAllConstantPattern>(pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu Just notice non-constant vector is also supported in your PR. Please feel free to continue its support in it if needed. The handling of some corner cases could be referred from mine https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132. Thanks.
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
Description
Split function just support split
string
by a single character for now.This pr to support split by
regex
string.Examples:
Notice:
There are some semantic diff between Java regex and re2 regex. And the lookahead/lookbehind patterns are not supported by RE2. If used, there will be a runtime exception (compile pattern error). e.g. :
#6017