[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082

zml1206 · 2024-11-28T08:37:36Z

What changes were proposed in this pull request?

Scan used to fallback if there was an unsupported filter. This PR filters out the supported expressions and offloads scan as much as possible to improve the performance. Before this change, the plan was "vanilla vectorized scan + c2r + vanilla filter" when scan contained an unsupported filter, and now the plan becomes "native scan + c2r + vanilla filter".

(Fixes: #7261)

How was this patch tested?

UT

github-actions · 2024-11-28T08:37:53Z

#7261

github-actions · 2024-11-28T08:38:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-28T08:55:50Z

Run Gluten Clickhouse CI on x86

zml1206 · 2024-11-28T09:02:08Z

cc @FelixYBW it can resolve #7261

zhztheplayer · 2024-12-04T02:07:14Z

cc @rui-mo

zhztheplayer · 2024-12-04T02:12:27Z

gluten-substrait/src/main/scala/org/apache/gluten/execution/ScanTransformerFactory.scala

+          transform.copy(dataFilters = PushDownUtil.pushFilters(scanExec.dataFilters))
+        } else {
+          transform
+        }


The code in ScanTransformerFactory is used by validator and offload rules. It feels a little weird to do validation in it? Do we have better choices?

How about use only pushedFilter here and rely on PushDownFilterToScan for subsequent pushdown?

Sounds feasible to me. Thanks.

github-actions · 2024-12-04T06:15:38Z

Run Gluten Clickhouse CI on x86

zml1206 · 2024-12-04T08:31:43Z

Test failure seems unrelated.

rui-mo

Thanks. Added some questions.

gluten-substrait/src/main/scala/org/apache/spark/sql/utils/PushDownUtil.scala

github-actions · 2024-12-06T05:19:04Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-06T07:14:13Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-06T07:26:32Z

Run Gluten Clickhouse CI on x86

rui-mo

Thanks! Looks good overall.

gluten-substrait/src/main/scala/org/apache/gluten/expression/ExpressionConverter.scala

gluten-substrait/src/main/scala/org/apache/gluten/execution/BatchScanExecTransformer.scala

rui-mo

Regarding the plan change, we used to get 'vanilla scan + vanilla filter' for unsupported filter, and now we can get 'native scan + c2r + vanilla filter'. The new plan offloads 'scan' to native while I assume when the c2r is large and time-consuming we might not get performance improvement. Can our RAS strategy cover this case? cc: @zhztheplayer Thanks.

zml1206 · 2024-12-10T09:41:40Z

Regarding the plan change, we used to get 'vanilla scan + vanilla filter' for unsupported filter, and now we can get 'native scan + c2r + vanilla filter'. The new plan offloads 'scan' to native while I assume when the c2r is large and time-consuming we might not get performance improvement. Can our RAS strategy cover this case? cc: @zhztheplayer Thanks.

RAS currently cannot solve this problem, but from our production point of view, the cost of c2r is relatively small. @zhztheplayer What do you think?

github-actions · 2024-12-10T10:09:41Z

Run Gluten Clickhouse CI on x86

zml1206 · 2024-12-10T23:02:39Z

Run Gluten Clickhouse CI on x86

rui-mo · 2024-12-11T08:35:22Z

from our production point of view, the cost of c2r is relatively small

With the previous plan, C2R is only needed for rows after the filter, whose number might be largely reduced, while in the new plan, all rows need to be converted as rows. Perhaps in some cases the speedup of native scan cannot compensate for this overhead, and we might get performance regression. Perhaps we could optimize the plan in the future to choose between the two options with the RAS strategy.

zml1206 · 2024-12-11T08:41:09Z

from our production point of view, the cost of c2r is relatively small

With the previous plan, C2R is only needed for rows after the filter, whose number might be largely reduced, while in the new plan, all rows need to be converted as rows. Perhaps in some cases the speedup of native scan cannot compensate for this overhead, and we might get performance regression.

Before PR is Scan + ColumnToRow + Filter, After PR is nativeScan + VeloxColumnToRow + Filter, the number of rows after nativeScan should be less than or equal to that after scan.

rui-mo · 2024-12-11T08:43:27Z

Before PR is Scan + ColumnToRow + Filter

Why do we have the C2R between, because for unsupported filters they should be vanilla Spark computed? Please note that if a filter causes fallback of Scan, the filter must also fallback.

zml1206 · 2024-12-11T08:45:24Z

Why do we have the C2R between, because for unsupported filters they should be vanilla Spark computed?

The default for vanilla Spark reading parquet is vectorization.

rui-mo · 2024-12-11T08:52:18Z

@zml1206 I see your point. Thanks for explaning!

…apache#8082) Scan used to fallback if there was an unsupported filter. This PR filters out the supported expressions and offloads scan as much as possible to improve the performance. Before this change, the plan was "vanilla vectorized scan + c2r + vanilla filter" when scan contained an unsupported filter, and now the plan becomes "native scan + c2r + vanilla filter".

github-actions bot added CORE works for Gluten Core VELOX labels Nov 28, 2024

zml1206 changed the title ~~[GLUTEN-7261][CORE] Use pushedFilters to offload scan when filter need fallbac~~ [GLUTEN-7261][CORE] Use pushedFilters to offload scan when filter need fallback Nov 28, 2024

zml1206 requested a review from zhztheplayer November 29, 2024 00:57

zhztheplayer reviewed Dec 4, 2024

View reviewed changes

zml1206 changed the title ~~[GLUTEN-7261][CORE] Use pushedFilters to offload scan when filter need fallback~~ [GLUTEN-7261][CORE] Use pushedFilters instead of dataFilters to offload scan Dec 4, 2024

rui-mo reviewed Dec 4, 2024

View reviewed changes

gluten-substrait/src/main/scala/org/apache/spark/sql/utils/PushDownUtil.scala Outdated Show resolved Hide resolved

zml1206 marked this pull request as draft December 6, 2024 01:12

zml1206 force-pushed the GLUTEN-7261 branch from eaeb8f2 to 2d59e00 Compare December 6, 2024 05:18

github-actions bot added DATA_LAKE and removed VELOX labels Dec 6, 2024

zml1206 changed the title ~~[GLUTEN-7261][CORE] Use pushedFilters instead of dataFilters to offload scan~~ [GLUTEN-7261][CORE] Push partial filters to offload scan when filter need fallback Dec 6, 2024

zml1206 marked this pull request as ready for review December 6, 2024 07:13

github-actions bot added the VELOX label Dec 6, 2024

zml1206 requested a review from rui-mo December 10, 2024 05:32

rui-mo approved these changes Dec 10, 2024

View reviewed changes

rui-mo reviewed Dec 10, 2024

View reviewed changes

zml1206 added 2 commits December 10, 2024 18:07

[CORE] Push partial filters to offload scan when filter need fallback

b12127a

update

41855f9

zml1206 added 2 commits December 10, 2024 18:07

add ut

bdf62b7

update

c4a7368

zml1206 force-pushed the GLUTEN-7261 branch from dbb0842 to c4a7368 Compare December 10, 2024 10:09

rui-mo approved these changes Dec 11, 2024

View reviewed changes

rui-mo changed the title ~~[GLUTEN-7261][CORE] Push partial filters to offload scan when filter need fallback~~ [GLUTEN-7261][CORE] Support offloading partial filters to native scan Dec 11, 2024

rui-mo merged commit 1036c96 into apache:main Dec 11, 2024
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082

[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082

zml1206 commented Nov 28, 2024 •

edited by rui-mo

Loading

github-actions bot commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

zml1206 commented Nov 28, 2024

zhztheplayer commented Dec 4, 2024

zhztheplayer Dec 4, 2024 •

edited

Loading

zml1206 Dec 4, 2024

zhztheplayer Dec 4, 2024

github-actions bot commented Dec 4, 2024

zml1206 commented Dec 4, 2024

rui-mo left a comment

github-actions bot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024

rui-mo left a comment

rui-mo left a comment

zml1206 commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

zml1206 commented Dec 10, 2024

rui-mo commented Dec 11, 2024 •

edited

Loading

zml1206 commented Dec 11, 2024

rui-mo commented Dec 11, 2024 •

edited

Loading

zml1206 commented Dec 11, 2024

rui-mo commented Dec 11, 2024

[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082

[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082

Conversation

zml1206 commented Nov 28, 2024 • edited by rui-mo Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

zml1206 commented Nov 28, 2024

zhztheplayer commented Dec 4, 2024

zhztheplayer Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

zml1206 Dec 4, 2024

Choose a reason for hiding this comment

zhztheplayer Dec 4, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 4, 2024

zml1206 commented Dec 4, 2024

rui-mo left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

zml1206 commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

zml1206 commented Dec 10, 2024

rui-mo commented Dec 11, 2024 • edited Loading

zml1206 commented Dec 11, 2024

rui-mo commented Dec 11, 2024 • edited Loading

zml1206 commented Dec 11, 2024

rui-mo commented Dec 11, 2024

zml1206 commented Nov 28, 2024 •

edited by rui-mo

Loading

zhztheplayer Dec 4, 2024 •

edited

Loading

rui-mo commented Dec 11, 2024 •

edited

Loading

rui-mo commented Dec 11, 2024 •

edited

Loading