[GLUTEN-4763][VL] Add RewriteTypedImperativeAggregate rule for collect_list #4764
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
For some typed imperative aggregate functions, like
collect_list
/collect_set
, Spark will make agg buffer stored asBinaryType
in partial/ partial merge phase, even though these two functions' output isArrayType
. Gluten has no such special handling. For Velox backend, agg function's raw data type is used for agg buffer in partial/partial merge. Thus, there will be a type mismatch issue in the latter shuffle operator, as Gluten plan inherits attributes from its corresponding Spark plan. Forcollect_list
/collect_set
, shuffle expectsBinaryType
, but getsArrayType
.So, we need a rule
RewriteTypedImperativeAggregate
to rewritecollect_list
andcollect_set
for Velox backend. Referenced the implementation in #2669.Velox does not register the companion functions for
collect_set
, to enablecollect_set
, modifications need to be made in Velox.Currently, there are still some issues, such as the semantics of null, where Spark and Velox have many differences. At present,
collect_list
supports ignoring null values, but when all inputs are null, Spark returns an empty array, while Velox returns null.collect_set
does not yet support ignoring null values, and there is the same problem when all inputs are null. Adaptations on the Velox side are still needed.How was this patch tested?
collect_list
inVeloxAggregateFunctionsSuite.distinct functions
VeloxAggregateFunctionsSuite.collect_list null inputs