How to implement Expand operator of Spark in Velox? #5958

JkSelf · 2023-08-02T03:36:18Z

JkSelf
Aug 2, 2023
Collaborator

Spark utilizes the Expand operator to handle ROLLUP functions, yet it differs from GroupID. The Expand operator focuses on projections, including aggregation expressions, grouping expressions, and spark_grouping_id. Its purpose is to demonstrate how a single row can be expanded into multiple rows. However, unlike GroupID, the Expand operator in Spark cannot explicitly distinguish between the aggregation key and the grouping key based on projections alone.

Take the lineitem table in TPCH as an example. And the query is select sum(l_suppkey) from lineitem where group by ROLLUP(l_orderkey, l_partkey). The physical plan in spark is below.

== Physical Plan ==\n
AdaptiveSparkPlan isFinalPlan=false\n
+- HashAggregate(keys=[l_orderkey#42L, l_partkey#43L, spark_grouping_id#41L], functions=[sum(l_suppkey#2L)])\n
   +- Exchange hashpartitioning(l_orderkey#42L, l_partkey#43L, spark_grouping_id#41L, 200), ENSURE_REQUIREMENTS, [plan_id=26]\n
      +- HashAggregate(keys=[l_orderkey#42L, l_partkey#43L, spark_grouping_id#41L], functions=[partial_sum(l_suppkey#2L)])\n
         +- Expand [[l_suppkey#2L, l_orderkey#0L, l_partkey#1L, 0], [l_suppkey#2L, l_orderkey#0L, null, 1], [l_suppkey#2L, null, null, 3]], [l_suppkey#2L, l_orderkey#42L, l_partkey#43L, spark_grouping_id#41L]\n
            +- BatchScan[l_orderkey#0L, l_partkey#1L, l_suppkey#2L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/home/sparkuser/jk/projects/gluten-3.2/gluten-core/src/test/resou..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<l_orderkey:bigint,l_partkey:bigint,l_suppkey:bigint>, PushedFilters: [] RuntimeFilters: []\n\n|

The input and output for Expand and HashAggregate operators are illustrated below. It is hard to distinguish the aggregation key and grouping key in Expand operator. So we added a new Expand operator in Velox to align with Spark. See #5403

(10) Expand
Input [3]: [l_orderkey#0L, l_partkey#1L, l_suppkey#2L]
Arguments: [[l_suppkey#2L, l_orderkey#0L, l_partkey#1L, 0], [l_suppkey#2L, l_orderkey#0L, null, 1], [l_suppkey#2L, null, null, 3]], [l_suppkey#2L, l_orderkey#59L, l_partkey#60L, spark_grouping_id#58L]

(11) HashAggregate
Input [4]: [l_suppkey#2L, l_orderkey#59L, l_partkey#60L, spark_grouping_id#58L]
Keys [3]: [l_orderkey#59L, l_partkey#60L, spark_grouping_id#58L]
Functions [1]: [partial_sum(l_suppkey#2L)]
Aggregate Attributes [1]: [sum#68L]
Results [4]: [l_orderkey#59L, l_partkey#60L, spark_grouping_id#58L, sum#69L]

JkSelf · 2023-08-02T03:36:44Z

JkSelf
Aug 2, 2023
Collaborator Author

@mbasmanova

0 replies

mbasmanova · 2023-08-02T10:30:05Z

mbasmanova
Aug 2, 2023
Collaborator

@JkSelf Thank you for the examples. It is very helpful. I'm hesitant to introduce this operator to Velox since it feels legacy. GroupId is more elegant, easier to reason about and less verbose solution for ROLLUP queries. It feels like it should be possible to identify a pattern of Expand followed by Aggregation and convert it to GroupId followed by Aggregation. Is this something you would consider implementing in Gluten?

25 replies

mbasmanova Sep 14, 2023
Collaborator

@JkSelf This is not quite what I had in mind.

Let's say we construct array from 2 values: array(a, b). If a is a vector of [a1, a2, a3] and b a vector of [b1, b2, b3], then we want to create elements vector [a1, b1, a2, b2, a3, b3].

To do that, we allocate a vector of size 6:[_, _, _, _, _, _, ].

Then copy values from a: [a1, _, a2, _, a3, _].

Then copy values from b: [a1, b1, a2, b2, a3, b3].

To copy values from a, we call copyValuesAndNulls with rows = [0, 2, 4] and toSourceRow = [0, 0, 1, 0, 2, 0].

To copy values from b, we call copyValuesAndNulls with rows = [1, 3, 5] and toSourceRow = [0, 0, 0, 1, 0, 2].

Hope this helps.

mbasmanova Sep 14, 2023
Collaborator

@JkSelf Here is what this might look like: https://github.com/facebookincubator/velox/compare/main...mbasmanova:velox-1:opt-array-constructor?expand=1

JkSelf Sep 14, 2023
Collaborator Author

@mbasmanova Thanks for your help. This method facilitates a reduction in project time from 21s to 10s. However, the performance is still slower compared to the Expand operation.

Here is the printPlanWithStats of Expand:

-- Expand[[l_orderkey, l_partkey, l_suppkey, 0], [l_orderkey, null, l_suppkey, 1], [null, null, l_suppkey, 2]] -> l_orderkey:BIGINT, l_partkey:BIGINT, l_suppkey:BIGINT, group_id_0:BIGINT
   Output: 80357184 rows (1.47GB, 8064 batches), Cpu time: 101.74ms, Blocked wall time: 0ns, Peak memory: 256B, Memory allocations: 8064, Threads: 4
  -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
     Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.32s, Blocked wall time: 0ns, Peak memory: 308.52MB, Memory allocations: 15388, Threads: 4, Splits: 10

Here is the printPlanWithStats of Project and Unnest after copyValuesAndNulls optimization:

-- Unnest[c0, c1, c2] -> c0_e:BIGINT, c1_e:BIGINT, c2_e:BIGINT
   Output: 80357184 rows (1.98GB, 2688 batches), Cpu time: 2.82s, Blocked wall time: 0ns, Peak memory: 356.00KB, Memory allocations: 21504, Threads: 4
  -- Project[expressions: (c0:ARRAY<BIGINT>, array_constructor(ROW["l_suppkey"],ROW["l_suppkey"],ROW["l_suppkey"])), (c1:ARRAY<BIGINT>, array_constructor(ROW["l_orderkey"],ROW["l_orderkey"],null)), (c2:ARRAY<BIGINT>, array_constructor(ROW["l_partkey"],null,null))] -> c0:ARRAY<BIGINT>, c1:ARRAY<BIGINT>, c2:ARRAY<BIGINT>
     Output: 26785728 rows (2.72GB, 2688 batches), Cpu time: 10.29s, Blocked wall time: 0ns, Peak memory: 1.04MB, Memory allocations: 40332, Threads: 4
    -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
       Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.13s, Blocked wall time: 0ns, Peak memory: 313.31MB, Memory allocations: 16519, Threads: 4, Splits: 10

Here is the printPlanWithStats of Project and Unnest before copyValuesAndNulls optimization:

-- Unnest[c0, c1, c2] -> c0_e:BIGINT, c1_e:BIGINT, c2_e:BIGINT
   Output: 80357184 rows (1.98GB, 2688 batches), Cpu time: 2.73s, Blocked wall time: 0ns, Peak memory: 356.00KB, Memory allocations: 21504, Threads: 4
  -- Project[expressions: (c0:ARRAY<BIGINT>, array_constructor(ROW["l_suppkey"],ROW["l_suppkey"],ROW["l_suppkey"])), (c1:ARRAY<BIGINT>, array_constructor(ROW["l_orderkey"],ROW["l_orderkey"],null)), (c2:ARRAY<BIGINT>, array_constructor(ROW["l_partkey"],null,null))] -> c0:ARRAY<BIGINT>, c1:ARRAY<BIGINT>, c2:ARRAY<BIGINT>
     Output: 26785728 rows (2.72GB, 2688 batches), Cpu time: 21.04s, Blocked wall time: 0ns, Peak memory: 1.04MB, Memory allocations: 40332, Threads: 4
    -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
       Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.07s, Blocked wall time: 0ns, Peak memory: 308.52MB, Memory allocations: 16519, Threads: 4, Splits: 10

mbasmanova Sep 14, 2023
Collaborator

@JkSelf Thank you for re-running the benchmark.

mbasmanova Sep 14, 2023
Collaborator

FYI: #6567

iwanttobepowerful · 2023-11-28T09:45:41Z

iwanttobepowerful
Nov 28, 2023

@mbasmanova Thanks for your help. This method facilitates a reduction in project time from 21s to 10s. However, the performance is still slower compared to the Expand operation.

Here is the printPlanWithStats of Expand:

-- Expand[[l_orderkey, l_partkey, l_suppkey, 0], [l_orderkey, null, l_suppkey, 1], [null, null, l_suppkey, 2]] -> l_orderkey:BIGINT, l_partkey:BIGINT, l_suppkey:BIGINT, group_id_0:BIGINT
   Output: 80357184 rows (1.47GB, 8064 batches), Cpu time: 101.74ms, Blocked wall time: 0ns, Peak memory: 256B, Memory allocations: 8064, Threads: 4
  -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
     Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.32s, Blocked wall time: 0ns, Peak memory: 308.52MB, Memory allocations: 15388, Threads: 4, Splits: 10

Here is the printPlanWithStats of Project and Unnest after copyValuesAndNulls optimization:

-- Unnest[c0, c1, c2] -> c0_e:BIGINT, c1_e:BIGINT, c2_e:BIGINT
   Output: 80357184 rows (1.98GB, 2688 batches), Cpu time: 2.82s, Blocked wall time: 0ns, Peak memory: 356.00KB, Memory allocations: 21504, Threads: 4
  -- Project[expressions: (c0:ARRAY<BIGINT>, array_constructor(ROW["l_suppkey"],ROW["l_suppkey"],ROW["l_suppkey"])), (c1:ARRAY<BIGINT>, array_constructor(ROW["l_orderkey"],ROW["l_orderkey"],null)), (c2:ARRAY<BIGINT>, array_constructor(ROW["l_partkey"],null,null))] -> c0:ARRAY<BIGINT>, c1:ARRAY<BIGINT>, c2:ARRAY<BIGINT>
     Output: 26785728 rows (2.72GB, 2688 batches), Cpu time: 10.29s, Blocked wall time: 0ns, Peak memory: 1.04MB, Memory allocations: 40332, Threads: 4
    -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
       Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.13s, Blocked wall time: 0ns, Peak memory: 313.31MB, Memory allocations: 16519, Threads: 4, Splits: 10

Here is the printPlanWithStats of Project and Unnest before copyValuesAndNulls optimization:

-- Unnest[c0, c1, c2] -> c0_e:BIGINT, c1_e:BIGINT, c2_e:BIGINT
   Output: 80357184 rows (1.98GB, 2688 batches), Cpu time: 2.73s, Blocked wall time: 0ns, Peak memory: 356.00KB, Memory allocations: 21504, Threads: 4
  -- Project[expressions: (c0:ARRAY<BIGINT>, array_constructor(ROW["l_suppkey"],ROW["l_suppkey"],ROW["l_suppkey"])), (c1:ARRAY<BIGINT>, array_constructor(ROW["l_orderkey"],ROW["l_orderkey"],null)), (c2:ARRAY<BIGINT>, array_constructor(ROW["l_partkey"],null,null))] -> c0:ARRAY<BIGINT>, c1:ARRAY<BIGINT>, c2:ARRAY<BIGINT>
     Output: 26785728 rows (2.72GB, 2688 batches), Cpu time: 21.04s, Blocked wall time: 0ns, Peak memory: 1.04MB, Memory allocations: 40332, Threads: 4
    -- TableScan[table: lineitem] -> l_suppkey:BIGINT, l_orderkey:BIGINT, l_partkey:BIGINT
       Input: 26785728 rows (752.62MB, 2688 batches), Output: 26785728 rows (752.62MB, 2688 batches), Cpu time: 2.07s, Blocked wall time: 0ns, Peak memory: 308.52MB, Memory allocations: 16519, Threads: 4, Splits: 10

@JkSelf Hi, #6566 and #6568 . How much will performance improve after merging these two PRs?

2 replies

iwanttobepowerful Nov 28, 2023

21s to 10s?

JkSelf Nov 28, 2023
Collaborator Author

@iwanttobepowerful Yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement Expand operator of Spark in Velox? #5958

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 27 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to implement Expand operator of Spark in Velox? #5958

JkSelf Aug 2, 2023 Collaborator

Replies: 3 comments · 27 replies

JkSelf Aug 2, 2023 Collaborator Author

mbasmanova Aug 2, 2023 Collaborator

mbasmanova Sep 14, 2023 Collaborator

mbasmanova Sep 14, 2023 Collaborator

JkSelf Sep 14, 2023 Collaborator Author

mbasmanova Sep 14, 2023 Collaborator

mbasmanova Sep 14, 2023 Collaborator

iwanttobepowerful Nov 28, 2023

iwanttobepowerful Nov 28, 2023

JkSelf Nov 28, 2023 Collaborator Author

JkSelf
Aug 2, 2023
Collaborator

Replies: 3 comments 27 replies

JkSelf
Aug 2, 2023
Collaborator Author

mbasmanova
Aug 2, 2023
Collaborator

mbasmanova Sep 14, 2023
Collaborator

mbasmanova Sep 14, 2023
Collaborator

JkSelf Sep 14, 2023
Collaborator Author

mbasmanova Sep 14, 2023
Collaborator

mbasmanova Sep 14, 2023
Collaborator

iwanttobepowerful
Nov 28, 2023

JkSelf Nov 28, 2023
Collaborator Author