Support regex delimiter and limit argument for Spark split function #10248

gaoyangxiaozhu · 2024-06-18T08:04:28Z

Fixes Spark split function.

Supports splitting with a general regex pattern.
Supports limit parameter.
Rewrites as simple function.

Issue: #4066
Spark document: https://spark.apache.org/docs/latest/api/sql/#split
Spark implementation: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L523-L552

gaoyangxiaozhu · 2024-06-18T08:04:44Z

@rui-mo / @PHILO-HE for help review.

netlify · 2024-06-18T08:04:45Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`2a399d7`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/66b32e174ee1b7000855e17a

rui-mo · 2024-06-19T02:06:00Z

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

gaoyangxiaozhu · 2024-06-19T08:38:05Z

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

oh, i see, i didn't go through the PRs to check if already people have touch this work. sorry.

gaoyangxiaozhu · 2024-06-19T09:00:38Z

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

btw, @rui-mo / @PHILO-HE . looks #6155 only make support for constant pattern, not cover for non-constant pattern or limit parameter. Will confirm in the PR comments

gaoyangxiaozhu · 2024-06-25T06:43:35Z

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

PHILO-HE · 2024-06-26T01:32:19Z

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

@jackylee-ch, what do you think? I guess you took over the work of 6155.

jackylee-ch · 2024-06-26T01:49:12Z

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

gaoyangxiaozhu · 2024-06-26T07:58:27Z

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

sure, will do. @rui-mo / @PHILO-HE so please help review.

rui-mo

@gaoyangxiaozhu Could you check the corner cases and corresponding tests from my PR first? Thanks.
https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132

velox/docs/functions/spark/string.rst

velox/functions/sparksql/tests/SplitFunctionsTest.cpp

…/velox into gayangya/split_refactor

gaoyangxiaozhu · 2024-06-27T08:02:22Z

@gaoyangxiaozhu Could you check the corner cases and corresponding tests from my PR first? Thanks. https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132

updated and added ut to cover - https://github.com/gayangya/velox/blob/eddfb43831ac083a53ad17fcdca28bbd0c4b1e37/velox/functions/sparksql/tests/SplitFunctionsTest.cpp#L250

velox/functions/sparksql/SplitFunctions.cpp

…a/split_refactor

rui-mo

Added two nits. Thanks.

velox/functions/sparksql/Split.h

velox/functions/sparksql/tests/SplitTest.cpp

gaoyangxiaozhu · 2024-07-31T08:53:57Z

ping @mbasmanova

mbasmanova

@gaoyangxiaozhu Looks good % a few remaining questions / comments. Please, address @rui-mo 's comments.

mbasmanova · 2024-08-01T07:47:41Z

velox/docs/functions/spark/string.rst

+    is smaller than the size of ``string``, the resulting array only contains ``limit`` number of single characters
+    splitting from ``string``, if ``limit`` is not provided or is larger than the size of ``string``, the resulting 
+    array contains all the single characters of ``string`` and does not include an empty tail character.
+    The split function align with vanilla spark 3.4+ split function. ::


The split function align with vanilla spark 3.4+ split function.

Wondering why do we need this comment? Isn't it true for all functions that they are expected to match Spark 3.4+ behavior?

just in case people see different behavior happen for spark version below 3.4 as 3.2/3.3

What about other functions? Do they implement Spark 3.2/3.3 semantics? CC: @rui-mo

@gaoyangxiaozhu @rui-mo gentle ping

@mbasmanova For the split function, the behavior for empty delimiter was changed since Spark 3.4 with commit apache/spark@247306c.
When implementing functions in Velox, if there is semantic difference among Spark versions, I assume we need to follow the latest one (Spark 3.5). Does it makes sense? Thanks.

@rui-mo Rui, thank you for clarifying. I assume this is a general policy that applies to all functions. If so, it would be helpful to document it in https://facebookincubator.github.io/velox/spark_functions.html and clarify which Spark version we match and whether we match ANSI mode on or off.

Furthermore, assuming Spark doesn't guarantee backwards compatibility across versions, we would have to pick a specific version we match and cannot say "latest" or 3.4+. If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

@mbasmanova Thanks for the pointer. I'd like to document it.

If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

I assume it is the case, and it makes sense to me to pick a specific version.

Opened #10677 for the documentation.

@gaoyangxiaozhu I assume below statement and Since Spark 3.4 for splitEmptyDelimiter function could be removed.

The split function align with vanilla spark 3.4+ split function.

velox/functions/prestosql/tests/Utf8Test.cpp

velox/functions/sparksql/Split.h

…/velox into gayangya/split_refactor

gaoyangxiaozhu · 2024-08-07T02:09:09Z

kindly ping @mbasmanova

mbasmanova

@gaoyangxiaozhu Overall looks good. A few remaining nits.

velox/functions/sparksql/Split.h

velox/functions/sparksql/tests/SplitTest.cpp

mbasmanova

Thanks!

FelixYBW · 2024-08-12T08:12:34Z

@xiaoxmeng Can you merge the PR?

facebook-github-bot · 2024-08-12T16:07:21Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-12T22:51:21Z

@xiaoxmeng merged this pull request in b35bd61.

conbench-facebook · 2024-08-12T23:19:39Z

Conbench analyzed the 1 benchmark run on commit b35bd616.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2024

gaoyangxiaozhu mentioned this pull request Jun 18, 2024

[VL] When the regexExpr of the split function is an empty character or a non-ASCII character, fall back to Valina Spark apache/incubator-gluten#6127

Open

gaoyangxiaozhu force-pushed the gayangya/split_refactor branch 2 times, most recently from c38458a to 6842de6 Compare June 19, 2024 08:36

gaoyangxiaozhu mentioned this pull request Jun 19, 2024

Enhance split Spark function to support regex #6155

Closed

gaoyangxiaozhu changed the title ~~Refactor spark split function~~ [WIP] Refactor spark split function Jun 20, 2024

refactor spark split function

4e0f895

gaoyangxiaozhu force-pushed the gayangya/split_refactor branch from 6842de6 to 4e0f895 Compare June 25, 2024 06:39

gaoyangxiaozhu changed the title ~~[WIP] Refactor spark split function~~ Refactor spark split function Jun 25, 2024

Merge branch 'facebookincubator:main' into gayangya/split_refactor

eada804

rui-mo reviewed Jun 26, 2024

View reviewed changes

velox/docs/functions/spark/string.rst Outdated Show resolved Hide resolved

rui-mo reviewed Jun 26, 2024

View reviewed changes

velox/functions/sparksql/tests/SplitFunctionsTest.cpp Outdated Show resolved Hide resolved

gaoyangxiaozhu added 2 commits June 27, 2024 15:58

address comments

a9ec3de

Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…

eddfb43

…/velox into gayangya/split_refactor

gaoyangxiaozhu requested a review from rui-mo June 27, 2024 08:09

rui-mo reviewed Jun 27, 2024

View reviewed changes

velox/functions/sparksql/SplitFunctions.cpp Outdated Show resolved Hide resolved

address comment

f7f0967

gaoyangxiaozhu requested a review from rui-mo June 28, 2024 02:19

Merge branch 'facebookincubator:main' into gayangya/split_refactor

0348181

gaoyangxiaozhu added 4 commits July 29, 2024 20:20

fix format issue

ded360a

address non-well UTF-8 string

fdc8c87

Merge branch 'main' of https://github.com/gayangya/velox into gayangy…

dbb9f8a

…a/split_refactor

fix build

0ad32d0

gaoyangxiaozhu requested a review from mbasmanova July 30, 2024 05:03

rui-mo reviewed Jul 30, 2024

View reviewed changes

velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved

velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved

address comment

5197087

rui-mo reviewed Jul 31, 2024

View reviewed changes

velox/functions/sparksql/tests/SplitTest.cpp Outdated Show resolved Hide resolved

gaoyangxiaozhu added 2 commits July 31, 2024 14:32

Merge branch 'facebookincubator:main' into gayangya/split_refactor

0db194e

Merge branch 'facebookincubator:main' into gayangya/split_refactor

a2fc545

Merge branch 'facebookincubator:main' into gayangya/split_refactor

26e060f

mbasmanova reviewed Aug 1, 2024

View reviewed changes

gaoyangxiaozhu added 3 commits August 1, 2024 17:53

address comments

e43c558

Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…

7682d2d

…/velox into gayangya/split_refactor

add ut to cover wide-character as delimiter

4ad8f15

gaoyangxiaozhu requested a review from mbasmanova August 2, 2024 07:20

mbasmanova reviewed Aug 7, 2024

View reviewed changes

refactor to address comment

2a399d7

mbasmanova approved these changes Aug 7, 2024

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Aug 7, 2024

PHILO-HE mentioned this pull request Aug 12, 2024

[VL] Remove limitation for StringSplit function apache/incubator-gluten#6786

Open

facebook-github-bot closed this in b35bd61 Aug 12, 2024

facebook-github-bot added the Merged label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support regex delimiter and limit argument for Spark split function #10248

Support regex delimiter and limit argument for Spark split function #10248

gaoyangxiaozhu commented Jun 18, 2024 •

edited by rui-mo

Loading

gaoyangxiaozhu commented Jun 18, 2024

netlify bot commented Jun 18, 2024 •

edited

Loading

rui-mo commented Jun 19, 2024 •

edited

Loading

gaoyangxiaozhu commented Jun 19, 2024

gaoyangxiaozhu commented Jun 19, 2024

gaoyangxiaozhu commented Jun 25, 2024

PHILO-HE commented Jun 26, 2024

jackylee-ch commented Jun 26, 2024

gaoyangxiaozhu commented Jun 26, 2024

rui-mo left a comment •

edited

Loading

gaoyangxiaozhu commented Jun 27, 2024

rui-mo left a comment

gaoyangxiaozhu commented Jul 31, 2024

mbasmanova left a comment

mbasmanova Aug 1, 2024

gaoyangxiaozhu Aug 1, 2024

mbasmanova Aug 1, 2024

mbasmanova Aug 6, 2024

rui-mo Aug 6, 2024 •

edited

Loading

mbasmanova Aug 6, 2024

rui-mo Aug 6, 2024

rui-mo Aug 7, 2024

gaoyangxiaozhu commented Aug 7, 2024

mbasmanova left a comment

mbasmanova left a comment

FelixYBW commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

conbench-facebook bot commented Aug 12, 2024

Support regex delimiter and limit argument for Spark split function #10248

Support regex delimiter and limit argument for Spark split function #10248

Conversation

gaoyangxiaozhu commented Jun 18, 2024 • edited by rui-mo Loading

gaoyangxiaozhu commented Jun 18, 2024

netlify bot commented Jun 18, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

rui-mo commented Jun 19, 2024 • edited Loading

gaoyangxiaozhu commented Jun 19, 2024

gaoyangxiaozhu commented Jun 19, 2024

gaoyangxiaozhu commented Jun 25, 2024

PHILO-HE commented Jun 26, 2024

jackylee-ch commented Jun 26, 2024

gaoyangxiaozhu commented Jun 26, 2024

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

gaoyangxiaozhu commented Jun 27, 2024

rui-mo left a comment

Choose a reason for hiding this comment

gaoyangxiaozhu commented Jul 31, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova Aug 1, 2024

Choose a reason for hiding this comment

gaoyangxiaozhu Aug 1, 2024

Choose a reason for hiding this comment

mbasmanova Aug 1, 2024

Choose a reason for hiding this comment

mbasmanova Aug 6, 2024

Choose a reason for hiding this comment

rui-mo Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

mbasmanova Aug 6, 2024

Choose a reason for hiding this comment

rui-mo Aug 6, 2024

Choose a reason for hiding this comment

rui-mo Aug 7, 2024

Choose a reason for hiding this comment

gaoyangxiaozhu commented Aug 7, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

FelixYBW commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

conbench-facebook bot commented Aug 12, 2024

gaoyangxiaozhu commented Jun 18, 2024 •

edited by rui-mo

Loading

netlify bot commented Jun 18, 2024 •

edited

Loading

rui-mo commented Jun 19, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

rui-mo Aug 6, 2024 •

edited

Loading