Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark from_json function #11709

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Dec 2, 2024

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024
Copy link

netlify bot commented Dec 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a1fd0ed
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67872756b7306c00082f7fa2

@zhli1142015
Copy link
Contributor Author

cc @rui-mo and @PHILO-HE , thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some initial comments.

velox/functions/sparksql/specialforms/CMakeLists.txt Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@zhli1142015
Copy link
Contributor Author

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

@zhli1142015 zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@rui-mo rui-mo changed the title feat: Add from_json Spark function feat: Add Spark from_json function Dec 10, 2024
@zhli1142015 zhli1142015 requested a review from rui-mo December 11, 2024 03:39
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Added some comments.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@zhli1142015 zhli1142015 requested a review from rui-mo December 12, 2024 10:56
@zhli1142015 zhli1142015 force-pushed the add_from_json branch 3 times, most recently from c3696df to d5d801b Compare December 17, 2024 04:22
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

@zhli1142015 zhli1142015 requested a review from rui-mo December 27, 2024 08:05
@zhli1142015
Copy link
Contributor Author

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

Updated, thanks.

@zhli1142015
Copy link
Contributor Author

Kindly ping~, @rui-mo and @PHILO-HE , do you still have more comments? Thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments on the documentation. Thanks.

velox/docs/functions/spark/json.rst Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Show resolved Hide resolved
@ayushi-agarwal
Copy link

I am hitting this error after I used the recent change in one case of from_json. @zhli1142015 Any idea what might have gone wrong, I will also try to find more details.

Caused by: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (4294967295 vs. 75)
Retriable: False
Expression: idx < children_.size()
Context: Top-level Expression: from_json(n0_1)
Additional Context: Operator: FilterProject[1] 1 Operator: ValueStream[0] 0
Function: childAt
File: /home/cicdkey/gluten-build/incubator-gluten/dev/../ep/build-velox/build/velox_ep/velox/type/Type.h
Line: 1001
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_119ExtractJsonTypeImplIN8simdjson8fallback8ondemand5valueEE14KindDispatcherILNS0_8TypeKindE32EvE5applyES8_RNS0_4exec13GenericWriterEb.isra.0
# 4  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_119ExtractJsonTypeImplIRN8simdjson8fallback8ondemand8documentEE14KindDispatcherILNS0_8TypeKindE32EvE5applyES9_RNS0_4exec13GenericWriterEb.constprop.0
# 5  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_116FromJsonFunctionILNS0_8TypeKindE32EE19extractJsonToWriterERN8simdjson8fallback8ondemand8documentERNS0_4exec12VectorWriterINS0_7GenericINS0_7AnyTypeELb0ELb0EEEvEE
# 6  _ZNK8facebook5velox9functions8sparksql12_GLOBAL__N_116FromJsonFunctionILNS0_8TypeKindE32EE5applyERKNS0_17SelectivityVectorERSt6vectorISt10shared_ptrINS0_10BaseVectorEESaISD_EERKSB_IKNS0_4TypeEERNS0_4exec7EvalCtxERSD_
# 7  _ZN8facebook5velox4exec4Expr13applyFunctionERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEE
# 8  _ZN8facebook5velox4exec4Expr11evalAllImplERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEE
# 9  _ZN8facebook5velox4exec4Expr4evalERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEEPKNS1_7ExprSetE
# 10 _ZN8facebook5velox4exec7ExprSet4evalEiibRKNS0_17SelectivityVectorERNS1_7EvalCtxERSt6vectorISt10shared_ptrINS0_10BaseVectorEESaISB_EE
# 11 _ZN8facebook5velox4exec13FilterProject7projectERKNS0_17SelectivityVectorERNS1_7EvalCtxE
# 12 _ZN8facebook5velox4exec13FilterProject9getOutputEv
# 13 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
# 14 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
# 15 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 16 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 17 _ZN6gluten24WholeStageResultIterator4nextEv
# 18 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 19 0x00007faa8c60fa10


	at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:41)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
	at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
	at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
	at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
	at scala.collection.Iterator.isEmpty(Iterator.scala:387)
	at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
	at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.isEmpty(IteratorsV1.scala:90)
	at org.apache.gluten.execution.VeloxColumnarToRowExec$.toRowIterator(VeloxColumnarToRowExec.scala:121)

@zhli1142015
Copy link
Contributor Author

Thanks for reporting this. I think you may encounter below case. As schema names we got are all lower case, we can't get correct mapping between filed and data. We need to fallback this case.

       Seq[(String)](
          ("""{"id":1,"Id":2}"""),
          ("""{"id":3,"Id":4}""")
        )
          .toDF("txt")
          .write
          .parquet(path.getCanonicalPath)

        spark.read.parquet(path.getCanonicalPath).createOrReplaceTempView("tbl")

        runQueryAndCompare("select txt, from_json(txt, 'id INT, Id INT') from tbl") {
          checkSparkOperatorMatch[ProjectExec]
        }

@ayushi-agarwal
Copy link

Thanks for reporting this. I think you may encounter below case. As schema names we got are all lower case, we can't get correct mapping between filed and data. We need to fallback this case.

       Seq[(String)](
          ("""{"id":1,"Id":2}"""),
          ("""{"id":3,"Id":4}""")
        )
          .toDF("txt")
          .write
          .parquet(path.getCanonicalPath)

        spark.read.parquet(path.getCanonicalPath).createOrReplaceTempView("tbl")

        runQueryAndCompare("select txt, from_json(txt, 'id INT, Id INT') from tbl") {
          checkSparkOperatorMatch[ProjectExec]
        }

Ok, and for this case also results don't match, spark returns [null,1] and [null,3] whereas velox returns [0,1] and [0,3]
runQueryAndCompare("select txt, from_json(txt, 'id INT, id INT') from tbl") {
checkSparkOperatorMatch[ProjectExec]
}

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Jan 9, 2025

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result.
apache/incubator-gluten@cc68e23

@ayushi-agarwal
Copy link

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case
n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

@zhli1142015
Copy link
Contributor Author

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

Does your schema contain duplicate fields? Wouldn't this cause issues for other operations?

@ayushi-agarwal
Copy link

ayushi-agarwal commented Jan 9, 2025

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

Does your schema contain duplicate fields? Wouldn't this cause issues for other operations?

Ideally it should not happen in real world scenario, I was just creating some random test cases to check the behaviour difference

@zhli1142015
Copy link
Contributor Author

Got it, thanks for the clarification. I've updated the Gluten PR.

address comments

address comments

address comments

address comments

address comments

minor change

address comments

minor change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants