Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs #9920

Merged
merged 7 commits into from
Apr 5, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 3, 2024

Which issue does this PR close?

Follow on to #9874
part of #7013

Rationale for this change

In thinking through the implications of #9874 https://github.com/apache/arrow-datafusion/pull/9874/files#r1549419111 I think the AccumulatorArgs may be confusing

Thus I wanted to add some docs, examples, and tests to better explain how to use them

What changes are included in this PR?

  1. Improved documentation

Are these changes tested?

Just docs

Are there any user-facing changes?

Better docs


/// Return a not yet implemented error if IGNORE NULLs is true
pub fn check_ignore_nulls(&self, name: &str) -> Result<()> {
if self.ignore_nulls {
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check !self.ignore_nulls?

I think checkXXX should be added for the user if they think they need to enable it.
In this case, when the user chooses to enable ignore_nulls, they need to add the check. If ignore_nulls is false, it means they should fix their query to contains ignore nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing for the user to understand whether they need to check or not.
I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a problem because if they forget to add check_ignore_nulls/check_order_by in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.

To enforce they specified the options for their functions, I think we can add the checking function in AggregateUDFImpl. So the user needs to set the true/false for their options

Either

// We will check if the AccumulatorArgs meet the requirement or not.
fn options() -> AccumulatorArgs {
   AccumulatorArgs {
     ignore_nulls: true/ false,
     has_ordering: true / false
    }
}

or

// The same with this one, but separate each option
fn support_ignore_nulls() -> bool
fn support_ordering() -> bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check !self.ignore_nulls?

I don't think we can do this, because the ignore_nulls is true for the following queries

SELECT avg(x) FROM ...;
SELECT avg(x) RESPECT NULLS FROM ...;

In other words, it is the default even when the user doesn't explicitly specify the handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing for the user to understand whether they need to check or not. I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

I don't quite understand this suggestion -- are you suggesting rename check_ignore_nulls to disable_ignore_nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a problem because if they forget to add check_ignore_nulls/check_order_by in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.

That is a good point --- in fact it actually affects built in aggregates too today

select count(*) from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+
1 row in set. Query took 0.039 seconds.

❯ select count(*) IGNORE NULLS from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+
1 row in set. Query took 0.001 seconds.

I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing for the user to understand whether they need to check or not. I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

I don't quite understand this suggestion -- are you suggesting rename check_ignore_nulls to disable_ignore_nulls?

yes, that is what I suggest, so we know exactly whether it is disabled or not. But I think the comment here also helps, rename is not neccessary

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR

But I think if we implement these for UDFImpl,

fn support_ignore_nulls() -> bool
fn support_ordering() -> bool

we probably don't need the check_ignore_nulls, because we can check it for them!

create_aggregate_expr is the earliest place we know ignore_nulls and sort_exprs, and we can call fun.support_ignore_nulls() to check for them, so do ordering.

https://github.com/apache/arrow-datafusion/blob/daf182dc789230dbd9cf21ca2e975789213a5365/datafusion/physical-plan/src/udaf.rs#L38-L46

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea and I think it should be done in #9924

I will update this PR to remove the check_* functions and only update the docs

@alamb alamb added the documentation Improvements or additions to documentation label Apr 3, 2024
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Apr 4, 2024
@alamb alamb changed the title Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs and examples Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs Apr 4, 2024
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the code changes / tests from this PR and it now has only examples

@@ -526,7 +526,6 @@ impl Accumulator for TimeSum {
let arr = arr.as_primitive::<TimestampNanosecondType>();

for v in arr.values().iter() {
println!("Adding {v}");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanups

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor Author

alamb commented Apr 5, 2024

Thank you for the very insightful review @jayzhan211 🙏

@alamb alamb merged commit 2dad904 into apache:main Apr 5, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants