-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Spark from_json function #11709
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
006efc5
to
89d888e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added some initial comments.
d1c7d69
to
d74a262
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json
in Spark and make sure the result is correct.
The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as The only existing unit tests in Spark related to this function are found in |
a284e49
to
2762885
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Added some comments.
68dab93
to
5bdc4c2
Compare
c3696df
to
d5d801b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!
f19beba
to
e3e80be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?
Updated, thanks. |
1ce4f81
to
4f63a5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on the documentation. Thanks.
1f6a76c
to
dd458b9
Compare
I am hitting this error after I used the recent change in one case of from_json. @zhli1142015 Any idea what might have gone wrong, I will also try to find more details.
|
Thanks for reporting this. I think you may encounter below case. As schema names we got are all lower case, we can't get correct mapping between filed and data. We need to fallback this case.
|
Ok, and for this case also results don't match, spark returns [null,1] and [null,3] whereas velox returns [0,1] and [0,3] |
I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. |
It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case |
Does your schema contain duplicate fields? Wouldn't this cause issues for other operations? |
Ideally it should not happen in real world scenario, I was just creating some random test cases to check the behaviour difference |
Got it, thanks for the clarification. I've updated the Gluten PR. |
address comments address comments address comments address comments address comments minor change address comments minor change
ab4ed09
to
a1fd0ed
Compare
Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):
Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.