Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regex delimiter and limit argument for Spark split function #10248

Closed

Conversation

gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Jun 18, 2024

Fixes Spark split function.

  1. Supports splitting with a general regex pattern.
  2. Supports limit parameter.
  3. Rewrites as simple function.

Issue: #4066
Spark document: https://spark.apache.org/docs/latest/api/sql/#split
Spark implementation: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L523-L552

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2024
@gaoyangxiaozhu
Copy link
Contributor Author

@rui-mo / @PHILO-HE for help review.

Copy link

netlify bot commented Jun 18, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2a399d7
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66b32e174ee1b7000855e17a

@rui-mo
Copy link
Collaborator

rui-mo commented Jun 19, 2024

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

@gaoyangxiaozhu gaoyangxiaozhu force-pushed the gayangya/split_refactor branch 2 times, most recently from c38458a to 6842de6 Compare June 19, 2024 08:36
@gaoyangxiaozhu
Copy link
Contributor Author

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

oh, i see, i didn't go through the PRs to check if already people have touch this work. sorry.

@gaoyangxiaozhu
Copy link
Contributor Author

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

btw, @rui-mo / @PHILO-HE . looks #6155 only make support for constant pattern, not cover for non-constant pattern or limit parameter. Will confirm in the PR comments

@gaoyangxiaozhu gaoyangxiaozhu changed the title Refactor spark split function [WIP] Refactor spark split function Jun 20, 2024
@gaoyangxiaozhu gaoyangxiaozhu force-pushed the gayangya/split_refactor branch from 6842de6 to 4e0f895 Compare June 25, 2024 06:39
@gaoyangxiaozhu gaoyangxiaozhu changed the title [WIP] Refactor spark split function Refactor spark split function Jun 25, 2024
@gaoyangxiaozhu
Copy link
Contributor Author

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

@PHILO-HE
Copy link
Contributor

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

@jackylee-ch, what do you think? I guess you took over the work of 6155.

@jackylee-ch
Copy link
Contributor

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

@gaoyangxiaozhu
Copy link
Contributor Author

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

sure, will do. @rui-mo / @PHILO-HE so please help review.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu gaoyangxiaozhu requested a review from rui-mo June 27, 2024 08:09
@gaoyangxiaozhu gaoyangxiaozhu requested a review from rui-mo June 28, 2024 02:19
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two nits. Thanks.

velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved
velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved
@gaoyangxiaozhu
Copy link
Contributor Author

ping @mbasmanova

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu Looks good % a few remaining questions / comments. Please, address @rui-mo 's comments.

is smaller than the size of ``string``, the resulting array only contains ``limit`` number of single characters
splitting from ``string``, if ``limit`` is not provided or is larger than the size of ``string``, the resulting
array contains all the single characters of ``string`` and does not include an empty tail character.
The split function align with vanilla spark 3.4+ split function. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split function align with vanilla spark 3.4+ split function.

Wondering why do we need this comment? Isn't it true for all functions that they are expected to match Spark 3.4+ behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case people see different behavior happen for spark version below 3.4 as 3.2/3.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other functions? Do they implement Spark 3.2/3.3 semantics? CC: @rui-mo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu @rui-mo gentle ping

Copy link
Collaborator

@rui-mo rui-mo Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova For the split function, the behavior for empty delimiter was changed since Spark 3.4 with commit apache/spark@247306c.
When implementing functions in Velox, if there is semantic difference among Spark versions, I assume we need to follow the latest one (Spark 3.5). Does it makes sense? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, thank you for clarifying. I assume this is a general policy that applies to all functions. If so, it would be helpful to document it in https://facebookincubator.github.io/velox/spark_functions.html and clarify which Spark version we match and whether we match ANSI mode on or off.

Furthermore, assuming Spark doesn't guarantee backwards compatibility across versions, we would have to pick a specific version we match and cannot say "latest" or 3.4+. If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Thanks for the pointer. I'd like to document it.

If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

I assume it is the case, and it makes sense to me to pick a specific version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #10677 for the documentation.

@gaoyangxiaozhu I assume below statement and Since Spark 3.4 for splitEmptyDelimiter function could be removed.

The split function align with vanilla spark 3.4+ split function.

velox/functions/prestosql/tests/Utf8Test.cpp Show resolved Hide resolved
velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved
@gaoyangxiaozhu
Copy link
Contributor Author

kindly ping @mbasmanova

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu Overall looks good. A few remaining nits.

velox/functions/sparksql/Split.h Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SplitTest.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SplitTest.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SplitTest.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SplitTest.cpp Show resolved Hide resolved
velox/functions/sparksql/tests/SplitTest.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Aug 7, 2024
@FelixYBW
Copy link

@xiaoxmeng Can you merge the PR?

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in b35bd61.

Copy link

Conbench analyzed the 1 benchmark run on commit b35bd616.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants