-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50838][SQL]Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered #49518
Conversation
…ing of recursive CTEs
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for working on this feature, it will be very useful!
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
…es inlining of recursive CTEs" This reverts commit 23f65d5.
…lationRef and hasItsOwnUnionLoopRef
The whole checkRecursion logic is now rewritten and placed in ResolveWithCTE as discussed offline. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala
Outdated
Show resolved
Hide resolved
…tes checks now to be applied to unionLoop instead of cteDef node
…er recursive cteDef
As discussed offline,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the approach generally LGTM :) we would need thorough test coverage.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithCTE.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/cteOperators.scala
Outdated
Show resolved
Hide resolved
// Also, if recursion is allowed, we should check that there is no self-reference within | ||
// subqueries inside the CTE definition. | ||
if (allowRecursion) { | ||
checkForSelfReferenceInSubquery(name, relation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should do the check in CheckAnalysis
because refs can be lazily resolved due to PlanWithUnresolvedIdentifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole offline discussion we had on this is explained in the comment below.
…ursive refs inside subqueries as well. Remove check for subquery in CTESubstitution.
As discussed offline about subqueries:
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
Check about self references in subqueries moved back to checkAnalysis and performed in different way, in order not to have to check it reapeatedly in ResolveWithCTE (fixedPoint rule, in contrast checkAnalysis performed only once) |
case unionLoop: UnionLoop => | ||
// Recursive CTEs have already substituted Union to UnionLoop at this stage. | ||
// Here we perform additional checks for them. | ||
checkIfSelfReferenceIsPlacedCorrectly(unionLoop) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's match CTERelationDef
here and call checkForSelfReferenceInSubquery
. The function should use the cte id to find recursive references in subquery expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there we come to the problem that we discussed. Union
won't be substituted to UnionLoop
in case we have a self-reference in subquery.
Should I just leave it in ResolveWithCTE
? Because anyway, the part of ResolveWithCTE
I placed it will always be executed at most once.
messageParameters = Map.empty) | ||
case other => | ||
} | ||
unionLoop.foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's not use .foreach
here which causes O(n^2) time complexity. When matching certain nodes we will call unionLoopRefNotAllowedUnderCurrentNode
which also traverse the tree.
how about this:
def checkIfSelfReferenceIsPlacedCorrectly(plan: LogicalPlan, allowRecursiveRef: Boolean = true) {
case Join(left, right, LeftOuter, _, _) =>
checkIfSelfReferenceIsPlacedCorrectly(left, allowRecursiveRef)
checkIfSelfReferenceIsPlacedCorrectly(right, allowRecursiveRef = false)
...
case _: UnionLoopRef if !allowRecursiveRef => fail ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes completely sense. It should be fixed now. Thanks a lot
thanks, merging to master/4.0! |
…o throw an error if forbidden case is encountered ### What changes were proposed in this pull request? Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered: 1. Recursive term can contain one recursive reference only. 2. Recursive reference can't be used in some kinds of joins and aggregations. 3. Recursive references are not allowed in subqueries In addition, the name of `recursive` function inside `CTERelationDef` is rewritten to `hasRecursiveCTERelationRef` and adds `hasItsOwnUnionLoopRef` function as it is also needed to check if cteDef is recursive after substitution. A small bug in `CTESubstitution` is fixed which now enables substitution of self-references within subqueries as well (but not its resolution, as they are not allowed). ### Why are the changes needed? Support for the recursive CTE. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? will be tested in #49571 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49518 from milanisvet/checkRecursion. Authored-by: Milan Cupac <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4021d91) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered:
In addition, the name of
recursive
function insideCTERelationDef
is rewritten tohasRecursiveCTERelationRef
and addshasItsOwnUnionLoopRef
function as it is also needed to check if cteDef is recursive after substitution.A small bug in
CTESubstitution
is fixed which now enables substitution of self-references within subqueries as well (but not its resolution, as they are not allowed).Why are the changes needed?
Support for the recursive CTE.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
will be tested in #49571
Was this patch authored or co-authored using generative AI tooling?
No.