-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for sum(decimal) Spark aggregate function #5372
Add support for sum(decimal) Spark aggregate function #5372
Conversation
✅ Deploy Preview for meta-velox canceled.
|
dc8a517
to
2fa780b
Compare
@majetideepak Could you help review this PR? |
@majetideepak Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak Deepak, would you help review this PR?
Hi, @mbasmanova I think it would be better for spark’s decimal sum and decimal avg to inherit DecimalAggregate. When this PR was proposed, AverageAggregateBase was not moved to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, gluten only support no ansi mode, so we need to return null when overflow.
Can we rephrase this part a bit to describe the behavior of Spark more precisely? Because ANSI is a spark config, and this PR only supports ANSI OFF mode.
Currently, only ANSI OFF mode is supported in this PR, which means null is returned when overflow.
As quoted from Spark:
For decimal type, the initial value of sum
is 0. We need to keep sum
unchanged if the input is null, as SUM function ignores null input. The sum
can only be null if overflow happens under non-ansi mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the link of Spark's Sum implementation to the PR description? That could help reader understand this issue.
Also, we may need to record this behavior at https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/aggregate.rst.
177e853
to
5409b90
Compare
5409b90
to
fdaf27d
Compare
af9c524
to
cb7c7e2
Compare
velox/functions/sparksql/aggregates/tests/SumAggregationTest.cpp
Outdated
Show resolved
Hide resolved
b806f70
to
53645c0
Compare
52f573a
to
e1614c7
Compare
@kagamiori Thank you for your patience. I have addressed or made changes in response to the latest comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for adding this function!
velox/functions/sparksql/aggregates/tests/SumAggregationTest.cpp
Outdated
Show resolved
Hide resolved
velox/functions/sparksql/aggregates/tests/SumAggregationTest.cpp
Outdated
Show resolved
Hide resolved
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@kagamiori Thank you very much. I will also reimplement Spark decimal average based on the Simple Aggregate Function Interface. |
Hi @kagamiori @mbasmanova. Did anything go wrong in Facebook Internal tests? |
No worries. We were just waiting for internal tests to pass. Landing it now. |
@kagamiori merged this pull request in a38891a. |
Hi @liujiayi771, I found that velox/functions/sparksql/aggregates/tests/SumAggregationTest.cpp throws errors when running with UBSAN in two unit tests, SumAggregationTest.hookLimits and SumAggregationTest.overflow. Could you help fix them? I attached the error log below.
|
@kagamiori OK. I will take a look at it later this week. |
Thanks! |
resolve #5226.
spark sql needs an isEmpty attribute in decimal sum agg. We need to implement
a new decimal sum agg and add isEmpty semantics.
Currently, gluten only support no ansi mode, so we need to return null when
overflow. Spark judges whether overflow occurs by using isEmpty=false, but
sum is null. isEmpty can cooperate with sum to express there is overflow in
the intermediate result.
Spark's implement
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala