-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-7261][CORE] Support offloading partial filters to native scan #8082
Conversation
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
cc @rui-mo |
transform.copy(dataFilters = PushDownUtil.pushFilters(scanExec.dataFilters)) | ||
} else { | ||
transform | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in ScanTransformerFactory
is used by validator and offload rules. It feels a little weird to do validation in it? Do we have better choices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about use only pushedFilter
here and rely on PushDownFilterToScan
for subsequent pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds feasible to me. Thanks.
Run Gluten Clickhouse CI on x86 |
Test failure seems unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added some questions.
gluten-substrait/src/main/scala/org/apache/spark/sql/utils/PushDownUtil.scala
Outdated
Show resolved
Hide resolved
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good overall.
gluten-substrait/src/main/scala/org/apache/gluten/expression/ExpressionConverter.scala
Outdated
Show resolved
Hide resolved
gluten-substrait/src/main/scala/org/apache/gluten/expression/ExpressionConverter.scala
Outdated
Show resolved
Hide resolved
gluten-substrait/src/main/scala/org/apache/gluten/execution/BatchScanExecTransformer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the plan change, we used to get 'vanilla scan + vanilla filter' for unsupported filter, and now we can get 'native scan + c2r + vanilla filter'. The new plan offloads 'scan' to native while I assume when the c2r is large and time-consuming we might not get performance improvement. Can our RAS strategy cover this case? cc: @zhztheplayer Thanks.
RAS currently cannot solve this problem, but from our production point of view, the cost of c2r is relatively small. @zhztheplayer What do you think? |
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
With the previous plan, C2R is only needed for rows after the filter, whose number might be largely reduced, while in the new plan, all rows need to be converted as rows. Perhaps in some cases the speedup of native scan cannot compensate for this overhead, and we might get performance regression. Perhaps we could optimize the plan in the future to choose between the two options with the RAS strategy. |
Before PR is Scan + ColumnToRow + Filter, After PR is nativeScan + VeloxColumnToRow + Filter, the number of rows after nativeScan should be less than or equal to that after scan. |
Why do we have the C2R between, because for unsupported filters they should be vanilla Spark computed? Please note that if a filter causes fallback of Scan, the filter must also fallback. |
The default for vanilla Spark reading parquet is vectorization. |
@zml1206 I see your point. Thanks for explaning! |
…apache#8082) Scan used to fallback if there was an unsupported filter. This PR filters out the supported expressions and offloads scan as much as possible to improve the performance. Before this change, the plan was "vanilla vectorized scan + c2r + vanilla filter" when scan contained an unsupported filter, and now the plan becomes "native scan + c2r + vanilla filter".
What changes were proposed in this pull request?
Scan used to fallback if there was an unsupported filter. This PR filters out the supported expressions and offloads scan as much as possible to improve the performance. Before this change, the plan was "vanilla vectorized scan + c2r + vanilla filter" when scan contained an unsupported filter, and now the plan becomes "native scan + c2r + vanilla filter".
(Fixes: #7261)
How was this patch tested?
UT