Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4763][VL] Add RewriteTypedImperativeAggregate rule for collect_list #4764

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

liujiayi771
Copy link
Contributor

What changes were proposed in this pull request?

For some typed imperative aggregate functions, like collect_list/collect_set, Spark will make agg buffer stored as BinaryType in partial/ partial merge phase, even though these two functions' output is ArrayType. Gluten has no such special handling. For Velox backend, agg function's raw data type is used for agg buffer in partial/partial merge. Thus, there will be a type mismatch issue in the latter shuffle operator, as Gluten plan inherits attributes from its corresponding Spark plan. For collect_list/collect_set, shuffle expects BinaryType, but gets ArrayType.

So, we need a rule RewriteTypedImperativeAggregate to rewrite collect_list and collect_set for Velox backend. Referenced the implementation in #2669.

Velox does not register the companion functions for collect_set, to enable collect_set, modifications need to be made in Velox.

Currently, there are still some issues, such as the semantics of null, where Spark and Velox have many differences. At present, collect_list supports ignoring null values, but when all inputs are null, Spark returns an empty array, while Velox returns null. collect_set does not yet support ignoring null values, and there is the same problem when all inputs are null. Adaptations on the Velox side are still needed.

How was this patch tested?

  • Exists CI.
  • Add collect_list in VeloxAggregateFunctionsSuite.distinct functions
  • Add a new test case VeloxAggregateFunctionsSuite.collect_list null inputs

Copy link

#4763

Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 force-pushed the rewrite-typed-imperative-agg branch from dbaabba to 744dd73 Compare February 23, 2024 13:54
Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 force-pushed the rewrite-typed-imperative-agg branch from 744dd73 to f630c5b Compare February 23, 2024 13:56
Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

cc @PHILO-HE, thanks.

@liujiayi771 liujiayi771 force-pushed the rewrite-typed-imperative-agg branch from f630c5b to 0bfd92b Compare February 24, 2024 03:01
Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 force-pushed the rewrite-typed-imperative-agg branch from 0bfd92b to 908cfa2 Compare February 24, 2024 05:15
Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks so much for your work!
I note collect_list/set were excluded for window operator. Maybe, they can be removed now. It can be done in a separate pr. Thanks!
https://github.com/oap-project/gluten/blob/bda60be4c44621c3722ecfbe55e98fd22f06b04a/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L645

@PHILO-HE PHILO-HE merged commit 427cb40 into apache:main Feb 26, 2024
19 checks passed
@liujiayi771 liujiayi771 deleted the rewrite-typed-imperative-agg branch February 26, 2024 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants