-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Compatible lag function in spark-3.0 #7168
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
@PHILO-HE Can you help review this PR? |
Run Gluten Clickhouse CI |
@@ -31,7 +31,7 @@ object WindowFunctionsBuilder { | |||
val substraitFunc = windowFunc match { | |||
// Handle lag with negative inputOffset, e.g., converts lag(c1, -1) to lead(c1, 1). | |||
// Spark uses `-inputOffset` as `offset` for Lag function. | |||
case lag: Lag if lag.offset.eval(EmptyRow).asInstanceOf[Int] > 0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kecookier, seems offset > 0 implies inputOffset < 0, so the original handling is expected, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE This PR addresses the lag result mismatch issue in my current version of Spark. Thank you for your input. I have reviewed the Spark code, and now I understand the differences.
In Spark 3.0, the LAG function calculates the bound using both the offset and the direction. However, in versions post Spark 3.1, the function does not consider the direction. Instead, it uses a single literal expression that includes the offset of the current row to calculate the bound. For example, in lag(), the offset will be wrapped with UnaryMinus(offset).
lag in spark 3.0 https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L474
lag in spark 3.2 https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L538
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kecookier, I see. Thanks for your clarification! Seems there is no way to let gluten adapt to the two implementations. Should you fix it in a forked gluten on your side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will close this PR.
I will make it compatible with Spark 3.0 in the internal repo. |
What changes were proposed in this pull request?
This PR addresses the lag result mismatch issue in my current version of Spark. Thank you for your input. I have reviewed the Spark code, and now I understand the differences.
In Spark 3.0, the LAG function calculates the bound using both the offset and the direction. However, in versions post Spark 3.1, the function does not consider the direction. Instead, it uses a single literal expression that includes the offset of the current row to calculate the bound. For example, in lag(), the offset will be wrapped with UnaryMinus(offset).
lag in spark 3.0 https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L474
lag in spark 3.2 https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L538
How was this patch tested?
Exist UT.