Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark from_json function #11709

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Dec 2, 2024

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024
Copy link

netlify bot commented Dec 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit c3696df
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/675bbd0ee675b60008a4247e

@zhli1142015
Copy link
Contributor Author

cc @rui-mo and @PHILO-HE , thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some initial comments.

velox/functions/sparksql/specialforms/CMakeLists.txt Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@zhli1142015
Copy link
Contributor Author

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

@zhli1142015 zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@rui-mo rui-mo changed the title feat: Add from_json Spark function feat: Add Spark from_json function Dec 10, 2024
@zhli1142015 zhli1142015 requested a review from rui-mo December 11, 2024 03:39
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Added some comments.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants