-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support regex delimiter and limit argument for Spark split function #10248
Support regex delimiter and limit argument for Spark split function #10248
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE |
c38458a
to
6842de6
Compare
oh, i see, i didn't go through the PRs to check if already people have touch this work. sorry. |
btw, @rui-mo / @PHILO-HE . looks #6155 only make support for |
6842de6
to
4e0f895
Compare
@jackylee-ch, what do you think? I guess you took over the work of 6155. |
Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu Could you check the corner cases and corresponding tests from my PR first? Thanks.
https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132
…/velox into gayangya/split_refactor
updated and added ut to cover - https://github.com/gayangya/velox/blob/eddfb43831ac083a53ad17fcdca28bbd0c4b1e37/velox/functions/sparksql/tests/SplitFunctionsTest.cpp#L250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two nits. Thanks.
ping @mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu Looks good % a few remaining questions / comments. Please, address @rui-mo 's comments.
is smaller than the size of ``string``, the resulting array only contains ``limit`` number of single characters | ||
splitting from ``string``, if ``limit`` is not provided or is larger than the size of ``string``, the resulting | ||
array contains all the single characters of ``string`` and does not include an empty tail character. | ||
The split function align with vanilla spark 3.4+ split function. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split function align with vanilla spark 3.4+ split function.
Wondering why do we need this comment? Isn't it true for all functions that they are expected to match Spark 3.4+ behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case people see different behavior happen for spark version below 3.4 as 3.2/3.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other functions? Do they implement Spark 3.2/3.3 semantics? CC: @rui-mo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu @rui-mo gentle ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova For the split function, the behavior for empty delimiter was changed since Spark 3.4 with commit apache/spark@247306c.
When implementing functions in Velox, if there is semantic difference among Spark versions, I assume we need to follow the latest one (Spark 3.5). Does it makes sense? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui, thank you for clarifying. I assume this is a general policy that applies to all functions. If so, it would be helpful to document it in https://facebookincubator.github.io/velox/spark_functions.html and clarify which Spark version we match and whether we match ANSI mode on or off.
Furthermore, assuming Spark doesn't guarantee backwards compatibility across versions, we would have to pick a specific version we match and cannot say "latest" or 3.4+. If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Thanks for the pointer. I'd like to document it.
If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?
I assume it is the case, and it makes sense to me to pick a specific version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #10677 for the documentation.
@gaoyangxiaozhu I assume below statement and Since Spark 3.4
for splitEmptyDelimiter
function could be removed.
The split function align with vanilla spark 3.4+ split function.
kindly ping @mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu Overall looks good. A few remaining nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@xiaoxmeng Can you merge the PR? |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng merged this pull request in b35bd61. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Fixes Spark split function.
limit
parameter.Issue: #4066
Spark document: https://spark.apache.org/docs/latest/api/sql/#split
Spark implementation: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L523-L552