[QST] can we first filter then join? #5389

JustPlay · 2020-09-11T09:46:48Z

JustPlay
Sep 11, 2020

spark-rapids/shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/GpuHashJoin.scala

Line 257 in df00904

case BuildLeft => doJoinLeftRight(builtTable, streamedTable)

spark-rapids/shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/GpuHashJoin.scala

Line 258 in df00904

case BuildRight => doJoinLeftRight(streamedTable, builtTable)

spark-rapids/shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/GpuHashJoin.scala

Line 268 in df00904

    
           GpuFilter(joined, boundCondition.get, numOutputRows, numOutputBatches, filterTime)

Answered by revans2

Sep 11, 2020

Spark already does a filter push down as a part of join optimization. If a conditional only involves one side of a join then it is pushed down before the join. In other cases it cannot be pushed down so the filter happens after the join. In fact filtering after the join only works for Inner joins because for all other joins you may need to evaluate the conditional for all combinations of matching keys to see if the condition will allow or exclude the row from being created.

We are working with cudf to be able to push the conditional into the join so we can reduce the amount of data that is materialized and increase conditional joins to include non-inner joins. But that is not likely to sh…

View full answer

revans2 · 2020-09-11T12:54:32Z

revans2
Sep 11, 2020
Maintainer

Spark already does a filter push down as a part of join optimization. If a conditional only involves one side of a join then it is pushed down before the join. In other cases it cannot be pushed down so the filter happens after the join. In fact filtering after the join only works for Inner joins because for all other joins you may need to evaluate the conditional for all combinations of matching keys to see if the condition will allow or exclude the row from being created.

We are working with cudf to be able to push the conditional into the join so we can reduce the amount of data that is materialized and increase conditional joins to include non-inner joins. But that is not likely to show up until the 0.16 release of cudf at the earliest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] can we first filter then join? #5389

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[QST] can we first filter then join? #5389

JustPlay Sep 11, 2020

Replies: 1 comment

revans2 Sep 11, 2020 Maintainer

JustPlay
Sep 11, 2020

revans2
Sep 11, 2020
Maintainer