Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] When the regexExpr of the split function is an empty character or a non-ASCII character, fall back to Valina Spark #6127

Open
mcdull-zhang opened this issue Jun 18, 2024 · 5 comments
Labels
bug Something isn't working triage

Comments

@mcdull-zhang
Copy link

Backend

VL (Velox)

Bug description

empty character

Expression:  split('abc','')
Spark: ["a","b","c",""]
Gluten: Reason: (0 vs. 1) split only supports only single-character pattern

non-ASCII character

Expression:  split('a,b,c',',')
Spark: ["a","b","c"]
Gluten: Reason: (3 vs. 1) split only supports only single-character pattern

Spark version

Spark-3.2.x

Spark configurations

No response

System information

No response

Relevant logs

java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (0 vs. 1) split only supports only single-character pattern
Retriable: False
Expression: patternString.size() == 1
Context: split(n0_0, :VARCHAR)
Top-Level Context: Same as context.
Function: apply
File: ep/build-velox/build/velox_ep/velox/functions/sparksql/SplitFunctions.cpp
@mcdull-zhang mcdull-zhang added bug Something isn't working triage labels Jun 18, 2024
@jackylee-ch
Copy link
Contributor

This is because the current split only supports splitting delemiter, while spark supports regexp. If necessary, I can add a PR to fix this problem, because Velox already supports regexp_split.

@mcdull-zhang
Copy link
Author

@jackylee-ch Please fix it.

@gaoyangxiaozhu
Copy link
Contributor

hey @mcdull-zhang i am working on fixing split issue, there is a fix PR in velox facebookincubator/velox#10248, once the PR ready, i will update gluten part.

@gaoyangxiaozhu
Copy link
Contributor

This is because the current split only supports splitting delemiter, while spark supports regexp. If necessary, I can add a PR to fix this problem, because Velox already supports regexp_split.

hey @jackylee-ch regex_split can't be used here since it not support limit parameter, please check my this Pr facebookincubator/velox#10248

@jackylee-ch
Copy link
Contributor

hey @jackylee-ch regex_split can't be used here since it not support limit parameter, please check my this Pr facebookincubator/velox#10248

Great job, wait for your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants