Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50838][SQL]Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered #49518

Closed
wants to merge 31 commits into from

Conversation

milanisvet
Copy link
Contributor

@milanisvet milanisvet commented Jan 15, 2025

What changes were proposed in this pull request?

Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered:

  1. Recursive term can contain one recursive reference only.
  2. Recursive reference can't be used in some kinds of joins and aggregations.
  3. Recursive references are not allowed in subqueries

In addition, the name of recursive function inside CTERelationDef is rewritten to hasRecursiveCTERelationRef and adds hasItsOwnUnionLoopRef function as it is also needed to check if cteDef is recursive after substitution.

A small bug in CTESubstitution is fixed which now enables substitution of self-references within subqueries as well (but not its resolution, as they are not allowed).

Why are the changes needed?

Support for the recursive CTE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

will be tested in #49571

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jan 15, 2025
@milanisvet milanisvet changed the title Add checkAnalysis and corresponding errors [WIP][SPARK-50838][SQL]Add checkRecursion to check if all the rules about recursive queries are fulfilled. Adjust optimizer with UnionLoop cases. Jan 16, 2025
Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for working on this feature, it will be very useful!

@milanisvet milanisvet changed the title [WIP][SPARK-50838][SQL]Add checkRecursion to check if all the rules about recursive queries are fulfilled. Adjust optimizer with UnionLoop cases. [WIP][SPARK-50838][SQL]Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered Jan 20, 2025
@milanisvet
Copy link
Contributor Author

The whole checkRecursion logic is now rewritten and placed in ResolveWithCTE as discussed offline.
One note here: I am still not sure if we should keep datatype check since it throws an error already before coming to this part of the code in case data types of anchor and recursive part are different. Also, still not sure which equality check between data types should be used

@milanisvet
Copy link
Contributor Author

As discussed offline, checkIfSelfReferenceIsPlacedCorrectly and checkDataTypesAnchorAndRecursiveTerm definitions left in resolveWithCTE singleton, but invoked in checkAnalysis to be invoked only once during analysis and not multiple times as if it would have been if we had invoked them in resolveWithCTE as well.

checkNumberOfSelfReferences is moved earlier to CTESubstitution stage following the "fail early" approach

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach generally LGTM :) we would need thorough test coverage.

// Also, if recursion is allowed, we should check that there is no self-reference within
// subqueries inside the CTE definition.
if (allowRecursion) {
checkForSelfReferenceInSubquery(name, relation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should do the check in CheckAnalysis because refs can be lazily resolved due to PlanWithUnresolvedIdentifier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole offline discussion we had on this is explained in the comment below.

@milanisvet
Copy link
Contributor Author

As discussed offline about subqueries:

  1. The bug in CTESubstitution that self-references within subqueries are not substituted is fixed now.
  2. We agreed that we don't want to resolve self-references within subqueries at all, therefore, we can not perform the check about this in checkAnalysis, but rather in ResolveWithCTE.
  3. More thorough explanation: check in checkAnalysis would require union->unionLoop substitution, so that we still have cteId even if the CTE is inlined. However, only substituting union->unionLoop and not CTERelationRef->UnionLoopRef would make the current ResolveWithCTE code behave as not intended. We don't want to perform second substitution as we agreed not to resolve self-references within subqueries

@milanisvet
Copy link
Contributor Author

As discussed offline about subqueries:

  1. The bug in CTESubstitution that self-references within subqueries are not substituted is fixed now.
  2. We agreed that we don't want to resolve self-references within subqueries at all, therefore, we can not perform the check about this in checkAnalysis, but rather in ResolveWithCTE.
  3. More thorough explanation: check in checkAnalysis would require union->unionLoop substitution, so that we still have cteId even if the CTE is inlined. However, only substituting union->unionLoop and not CTERelationRef->UnionLoopRef would make the current ResolveWithCTE code behave as not intended. We don't want to perform second substitution as we agreed not to resolve self-references within subqueries

Check about self references in subqueries moved back to checkAnalysis and performed in different way, in order not to have to check it reapeatedly in ResolveWithCTE (fixedPoint rule, in contrast checkAnalysis performed only once)

case unionLoop: UnionLoop =>
// Recursive CTEs have already substituted Union to UnionLoop at this stage.
// Here we perform additional checks for them.
checkIfSelfReferenceIsPlacedCorrectly(unionLoop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's match CTERelationDef here and call checkForSelfReferenceInSubquery. The function should use the cte id to find recursive references in subquery expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there we come to the problem that we discussed. Union won't be substituted to UnionLoop in case we have a self-reference in subquery.
Should I just leave it in ResolveWithCTE? Because anyway, the part of ResolveWithCTE I placed it will always be executed at most once.

messageParameters = Map.empty)
case other =>
}
unionLoop.foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's not use .foreach here which causes O(n^2) time complexity. When matching certain nodes we will call unionLoopRefNotAllowedUnderCurrentNode which also traverse the tree.

how about this:

def checkIfSelfReferenceIsPlacedCorrectly(plan: LogicalPlan, allowRecursiveRef: Boolean = true) {
  case Join(left, right, LeftOuter, _, _) =>
    checkIfSelfReferenceIsPlacedCorrectly(left, allowRecursiveRef)
    checkIfSelfReferenceIsPlacedCorrectly(right, allowRecursiveRef = false)
  ...
  case _: UnionLoopRef if !allowRecursiveRef => fail ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes completely sense. It should be fixed now. Thanks a lot

@cloud-fan cloud-fan changed the title [WIP][SPARK-50838][SQL]Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered [SPARK-50838][SQL]Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered Jan 24, 2025
@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in 4021d91 Jan 24, 2025
cloud-fan pushed a commit that referenced this pull request Jan 24, 2025
…o throw an error if forbidden case is encountered

### What changes were proposed in this pull request?

Performs additional checks inside recursive CTEs to throw an error if forbidden case is encountered:
1. Recursive term can contain one recursive reference only.
2. Recursive reference can't be used in some kinds of joins and aggregations.
3. Recursive references are not allowed in subqueries

In addition, the name of `recursive` function inside `CTERelationDef` is rewritten to `hasRecursiveCTERelationRef` and adds `hasItsOwnUnionLoopRef` function as it is also needed to check if cteDef is recursive after substitution.

A small bug in `CTESubstitution` is fixed which now enables substitution of self-references within subqueries as well (but not its resolution, as they are not allowed).

### Why are the changes needed?

Support for the recursive CTE.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

will be tested in #49571

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49518 from milanisvet/checkRecursion.

Authored-by: Milan Cupac <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4021d91)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants