Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Remaining issues for typed imperative aggregate #4763

Open
liujiayi771 opened this issue Feb 23, 2024 · 2 comments
Open

[VL] Remaining issues for typed imperative aggregate #4763

liujiayi771 opened this issue Feb 23, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@liujiayi771
Copy link
Contributor

Description

  • collect_list
    • If all input values are null, vanilla Spark return an empty array, but Velox return null.
  • collect_set
    • Velox does not register the companion functions of set_agg.
    • If all input values are null, vanilla Spark return an empty array, but Velox return null.
    • If there are nulls in the input values(not all values are null), vanilla Spark ignore null input, but Velox not ignore null input.

Exclude UTs:

  1. SPARK-31993: concat_ws in agg function with plenty of string/array types columns in GlutenStringFunctionsSuite
    Reason: If all input values are null, collect_list in vanilla Spark return an empty array, but array_agg in Velox return null.
@felipepessoto
Copy link
Contributor

felipepessoto commented May 29, 2024

@liujiayi771 do you know if collect_set is not expected to work with complex types if the value is null? Example, this works with Spark, but doesn't work when Gluten is enabled:

import org.apache.spark.sql.functions._

val jsonStr = """{"txn":{"appId":"txnId","version":0,"lastUpdated":null}}"""
val jsonSchema = StructType(Seq(StructField("txn",
  StructType(Seq(StructField("appId",StringType,true),StructField("lastUpdated",LongType,true),StructField("version",LongType,true))),true
)))
val df = spark.read.schema(jsonSchema).json(Seq(jsonStr).toDS).select(collect_set(col("txn")))    
df.head

Error:

[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (c7f5 executor driver): org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxUserError
[info] Error Source: USER
[info] Error Code: INVALID_ARGUMENT
[info] Reason: ROW comparison not supported for values that contain nulls
[info] Retriable: False
[info] Expression: !decoded.base()->containsNullAt(indices[index])
[info] Function: checkNestedNulls
[info] File: /__w/1/s/Velox/velox/functions/lib/CheckNestedNulls.cpp
[info] Line: 34

@liujiayi771
Copy link
Contributor Author

@felipepessoto This is a known issue, the Velox backend does not yet support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants