Question about existence join semantics #3555
-
I'm debugging Spark's UT with Gluten + Velox, and found a query has different semantics from Spark.
I built this UT in HashJoinTest.cpp to reproduce this query, but the result of PlanNode execution is empty, so it fails.
The reason is that "match" column will return null if probe side join key is null, or build side contains null keys, and results in the last There's a fix to return false for the empty build side #3275. Why non-empty build side use different semantic? |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 7 replies
-
@mbasmanova Could you take a look? |
Beta Was this translation helpful? Give feedback.
-
CC: @rui-mo |
Beta Was this translation helpful? Give feedback.
-
The logic in HashProbe::fillOutput for semi join project w/ filter is incorrect. If build side it not empty, current logic emits NULL for all rows where probe-side join key is null. Instead, it needs to apply the filter and return NULL if at least one row passes the filter and return false if no row passes the filter. Similarly, when there is no match and build side has rows with null join keys, we need to evaluate the filter to null rows and return NULL if at least one row passes and return false if no rows pass. This logic is similar what we do for anti join. We need to refactor and unify semi join project and anti join code paths to avoid code duplication.
|
Beta Was this translation helpful? Give feedback.
-
The semantics of IN and EXISTS subqueries are different, hence, we need 2 types of semi join project: null-aware (IN) and regular (EXISTS). Null-aware (IN) left semi join project w/o extra filter:
Null-aware (IN) left semi join project w/ extra filter:
Regular (EXISTS) left semi join project w/o extra filter:
Regular (EXISTS) left semi join project w/ extra filter:
The value of 'match' column in the output of regular semi join is either TRUE or FALSE, but never NULL. The value of 'match' column in the output of null-aware semi join can be TRUE, FALSE, or NULL. When evaluating regular semi join, build side can ignore rows with null join keys. LeftSemiProjectJoin(a, b) = RightSemiProjectJoin(b, a) |
Beta Was this translation helpful? Give feedback.
-
Thank you @mbasmanova, based on your explanation, I think the semantic of Spark's Existence join should match Velox's regular semi-join. We simply added a projection to map null to false in Gluten as workaround, and can be removed when regular semi-join is supported in Velox. I also encountered another unmatched SQL, which should be AntiJoin w/ filter. Is this also the same bug mentioned in #3343 that need to be fixed?
UT to reproduce:
The query only outputs 2 rows whose join key is null, and omits other rows.
|
Beta Was this translation helpful? Give feedback.
-
@marin-ma For the anti join, NOT IN would map to kNullAwareAnti join type, not kAnti. See https://facebookincubator.github.io/velox/develop/anti-join.html#null-aware-anti-join Would you try to use kNullAwareAnti join type and let us know if you still see incorrect results? |
Beta Was this translation helpful? Give feedback.
-
@marin-ma @rui-mo Rong, Rui, I do see a bug in anti join with filter. Here is a fix: #3571. |
Beta Was this translation helpful? Give feedback.
-
Here is a side-by-side comparison of null-aware (IN) and regular (EXISTS) semantics. |
Beta Was this translation helpful? Give feedback.
-
@marin-ma @rui-mo I have a draft PR for adding support for null-aware semi join project w/ filter: #3599 I'll be out till the end of the year. Will pick up this work when I'm back. |
Beta Was this translation helpful? Give feedback.
-
@marin-ma @rui-mo Happy New Year! PR #3599 adds support for null-aware semi join project with filter. Would you try it out to see whether it works for your use cases? |
Beta Was this translation helpful? Give feedback.
@marin-ma @rui-mo Happy New Year! PR #3599 adds support for null-aware semi join project with filter. Would you try it out to see whether it works for your use cases?