Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark mask function #10264
Add Spark mask function #10264
Changes from 20 commits
95b679a
176ce92
69a5717
b60ad76
805d6e1
bbdf3f1
f75d197
4138d3c
0ed5f15
ab9c523
cf27370
930a339
209229f
deb156a
d1ce4df
72acae8
fe09441
4e4597e
ab3d1b7
5a58710
fd0a41c
aad59e2
ecec9f0
8d20919
d37cb88
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to provide fast path for ASCII inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm.. looks leverage
callASCII
have benefit but not much since we still need handlereplacement char
args non ASCII cases.Create a issue to do in seperate PR if needed. #10546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova It seems only
callAscii
is provided, while a function likecallAsciiNullable
is needed here. Do you think we need to add that? Thanks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same behavior with Spark? Perhaps document this in string.rst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactlly same, spark mask function actually has issue when handling invalid UTF-8 character or wide character (4 byte). For example for wide character "😀" which is 4 bytes
\uD83D\uDE00
, spark would wrongly treat it 2 characters,\uD83D
and\uDE00
. For invalided UTF-8 character case, for example\xED
, spark would treat it as 4 characters - "", "x", "E", "D". That's due to the limitation of spark use javatoString.map
to iterater the char of the input string which only work well for character has at most 2 bytes (16 bits). Checking spark code - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala#L322C1-L349C4For do right thing, in our implement, for those wide character cases or invalid UTF-8 character cases we treat it correctlly as single character, instead of aligning with spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark contain tests for invalid UTF-8 character and wide character (4 byte) like you mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have, this are the tests from spark i found - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala#L5772C1-L5804C4
and https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/mask-functions.sql
which we also already coverred in our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing above information. As you mentioned, the difference with Spark is on the handling of
invalid UTF-8 character and wide character
.I tried in Spark and found its output was inconsistent indeed, e.g., for below case this PR gives
yyyyyいいYyYYdddいいい
while Spark givesyyyyyいいYyYYdddいいいい
where one moreい
is added.Is it more like undefined behaviors in Spark because there is no test coverage for those cases? And the result from this PR looks more right from my view.
@mbasmanova Do you have any suggestion? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo , thanks and yes spark actually can't handle well when do the implement for invali UTF-8 character. And as i already mentioned in above comments : spark also can't handle wide-width character correctly.
spark mask function if you check the implement to levererage
toString.map(charObject)
iteration to replace the char, however, javatoString
default usingUTF-16
if my unserstanding right. And the java char class object use 16 bits array to represents a character that means the spark mask function can only handle weel for BMP char (that is 16 bits), it can't handle well for bothinvalid UTF-8 character
and alsowide-character
.for
世界🙂
, where世界
is chinese character which each one has 2 bytes, so it would be replaced correctly withいい
, but🙂
is a wide character which represents with 4 bytes, so spark wrongly treat it as 2 character each one have 2 bytes, causing🙂
replacing withいい
. That's why it finally causes 4い
.Another is for replacing using a wide-character, for example, below case would cause wrong garbled code problem issue in spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know your concern @rui-mo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaoyangxiaozhu I understand your point. Perhaps we can follow this suggestion to ask about it in Spark. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @rui-mo , I believe from the view of code implement part of spark mask function, they don't concern for
invalid UTF-8
andwide character
.I just raise a issue/ question to stackoverflow https://stackoverflow.com/staging-ground/78781459 and also Spark JIRA issue - https://issues.apache.org/jira/browse/SPARK-48973
let's waitting the spark community's reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!