-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs
#9920
Conversation
datafusion/expr/src/function.rs
Outdated
|
||
/// Return a not yet implemented error if IGNORE NULLs is true | ||
pub fn check_ignore_nulls(&self, name: &str) -> Result<()> { | ||
if self.ignore_nulls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check !self.ignore_nulls
?
I think checkXXX
should be added for the user if they think they need to enable it.
In this case, when the user chooses to enable ignore_nulls
, they need to add the check. If ignore_nulls
is false, it means they should fix their query to contains ignore nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is confusing for the user to understand whether they need to check or not.
I think rename it to disable_xxx
help or change the logic an rename it to enable_xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a problem because if they forget to add check_ignore_nulls/check_order_by
in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.
To enforce they specified the options for their functions, I think we can add the checking function in AggregateUDFImpl
. So the user needs to set the true/false for their options
Either
// We will check if the AccumulatorArgs meet the requirement or not.
fn options() -> AccumulatorArgs {
AccumulatorArgs {
ignore_nulls: true/ false,
has_ordering: true / false
}
}
or
// The same with this one, but separate each option
fn support_ignore_nulls() -> bool
fn support_ordering() -> bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check
!self.ignore_nulls
?
I don't think we can do this, because the ignore_nulls is true for the following queries
SELECT avg(x) FROM ...;
SELECT avg(x) RESPECT NULLS FROM ...;
In other words, it is the default even when the user doesn't explicitly specify the handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is confusing for the user to understand whether they need to check or not. I think rename it to
disable_xxx
help or change the logic an rename it toenable_xxx
I don't quite understand this suggestion -- are you suggesting rename check_ignore_nulls
to disable_ignore_nulls
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a problem because if they forget to add
check_ignore_nulls/check_order_by
in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.
That is a good point --- in fact it actually affects built in aggregates too today
❯ select count(*) from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3 |
+----------+
1 row in set. Query took 0.039 seconds.
❯ select count(*) IGNORE NULLS from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3 |
+----------+
1 row in set. Query took 0.001 seconds.
I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is confusing for the user to understand whether they need to check or not. I think rename it to
disable_xxx
help or change the logic an rename it toenable_xxx
I don't quite understand this suggestion -- are you suggesting rename
check_ignore_nulls
todisable_ignore_nulls
?
yes, that is what I suggest, so we know exactly whether it is disabled or not. But I think the comment here also helps, rename is not neccessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR
But I think if we implement these for UDFImpl,
fn support_ignore_nulls() -> bool
fn support_ordering() -> bool
we probably don't need the check_ignore_nulls
, because we can check it for them!
create_aggregate_expr
is the earliest place we know ignore_nulls
and sort_exprs
, and we can call fun.support_ignore_nulls()
to check for them, so do ordering
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea and I think it should be done in #9924
I will update this PR to remove the check_*
functions and only update the docs
AccumulatorArgs
and examplesAccumulatorArgs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the code changes / tests from this PR and it now has only examples
@@ -526,7 +526,6 @@ impl Accumulator for TimeSum { | |||
let arr = arr.as_primitive::<TimestampNanosecondType>(); | |||
|
|||
for v in arr.values().iter() { | |||
println!("Adding {v}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you for the very insightful review @jayzhan211 🙏 |
Which issue does this PR close?
Follow on to #9874
part of #7013
Rationale for this change
In thinking through the implications of #9874 https://github.com/apache/arrow-datafusion/pull/9874/files#r1549419111 I think the AccumulatorArgs may be confusing
Thus I wanted to add some docs, examples, and tests to better explain how to use them
What changes are included in this PR?
Are these changes tested?
Just docs
Are there any user-facing changes?
Better docs