Minor: Improve documentation for AggregateUDFImpl::accumulator and `AccumulatorArgs` #9920

alamb · 2024-04-03T11:21:25Z

Which issue does this PR close?

Follow on to #9874
part of #7013

Rationale for this change

In thinking through the implications of #9874 https://github.com/apache/arrow-datafusion/pull/9874/files#r1549419111 I think the AccumulatorArgs may be confusing

Thus I wanted to add some docs, examples, and tests to better explain how to use them

What changes are included in this PR?

Improved documentation

Are these changes tested?

Just docs

Are there any user-facing changes?

Better docs

…ccumulatorArgs`

jayzhan211 · 2024-04-03T11:39:13Z

datafusion/expr/src/function.rs

+
+    /// Return a not yet implemented error if IGNORE NULLs is true
+    pub fn check_ignore_nulls(&self, name: &str) -> Result<()> {
+        if self.ignore_nulls {


Should we check !self.ignore_nulls?

I think checkXXX should be added for the user if they think they need to enable it.
In this case, when the user chooses to enable ignore_nulls, they need to add the check. If ignore_nulls is false, it means they should fix their query to contains ignore nulls

I think it is confusing for the user to understand whether they need to check or not.
I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

There is also a problem because if they forget to add check_ignore_nulls/check_order_by in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.

To enforce they specified the options for their functions, I think we can add the checking function in AggregateUDFImpl. So the user needs to set the true/false for their options

Either

// We will check if the AccumulatorArgs meet the requirement or not. fn options() -> AccumulatorArgs { AccumulatorArgs { ignore_nulls: true/ false, has_ordering: true / false } }

or

// The same with this one, but separate each option fn support_ignore_nulls() -> bool fn support_ordering() -> bool

Should we check !self.ignore_nulls?

I don't think we can do this, because the ignore_nulls is true for the following queries

SELECT avg(x) FROM ...; SELECT avg(x) RESPECT NULLS FROM ...;

In other words, it is the default even when the user doesn't explicitly specify the handling

I think it is confusing for the user to understand whether they need to check or not. I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

I don't quite understand this suggestion -- are you suggesting rename check_ignore_nulls to disable_ignore_nulls?

There is also a problem because if they forget to add check_ignore_nulls/check_order_by in the accumulator, they can still run the function successfully. This approach does not force the user to check their options because datafusion implements them, not the user.

That is a good point --- in fact it actually affects built in aggregates too today

❯ select count(*) from (values (1), (null), (2)); +----------+ | COUNT(*) | +----------+ | 3 | +----------+ 1 row in set. Query took 0.039 seconds. ❯ select count(*) IGNORE NULLS from (values (1), (null), (2)); +----------+ | COUNT(*) | +----------+ | 3 | +----------+ 1 row in set. Query took 0.001 seconds.

I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR

I think it is confusing for the user to understand whether they need to check or not. I think rename it to disable_xxx help or change the logic an rename it to enable_xxx

I don't quite understand this suggestion -- are you suggesting rename check_ignore_nulls to disable_ignore_nulls?

yes, that is what I suggest, so we know exactly whether it is disabled or not. But I think the comment here also helps, rename is not neccessary

I think this is a sepate issue, and not made worse by this PR -- I filed #9924 to track. I suggest we work on improving it as a follow on PR

But I think if we implement these for UDFImpl,

fn support_ignore_nulls() -> bool fn support_ordering() -> bool

we probably don't need the check_ignore_nulls, because we can check it for them!

create_aggregate_expr is the earliest place we know ignore_nulls and sort_exprs, and we can call fun.support_ignore_nulls() to check for them, so do ordering.

https://github.com/apache/arrow-datafusion/blob/daf182dc789230dbd9cf21ca2e975789213a5365/datafusion/physical-plan/src/udaf.rs#L38-L46

This is a good idea and I think it should be done in #9924

I will update this PR to remove the check_* functions and only update the docs

alamb

I have removed the code changes / tests from this PR and it now has only examples

alamb · 2024-04-04T19:00:19Z

datafusion/core/tests/user_defined/user_defined_aggregates.rs

@@ -526,7 +526,6 @@ impl Accumulator for TimeSum {
        let arr = arr.as_primitive::<TimestampNanosecondType>();

        for v in arr.values().iter() {
-            println!("Adding {v}");


drive by cleanups

jayzhan211

👍

alamb · 2024-04-05T11:16:29Z

Thank you for the very insightful review @jayzhan211 🙏

alamb added 3 commits April 3, 2024 06:50

Minor: Improve documentation for AggregateUDFImpl::accumulator and `A…

67be22a

…ccumulatorArgs`

Add test and helper functions

315a9a4

Improve docs and examples

203c83f

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Apr 3, 2024

alamb mentioned this pull request Apr 3, 2024

Make FirstValue an UDAF, Change AggregateUDFImpl::accumulator signature, support ORDER BY for UDAFs #9874

Merged

jayzhan211 reviewed Apr 3, 2024

View reviewed changes

Fix CI

0f59565

alamb mentioned this pull request Apr 3, 2024

Some aggregates silently ignore IGNORE NULLS and ORDER BY on arguments #9924

Open

alamb added the documentation Improvements or additions to documentation label Apr 3, 2024

alamb added 3 commits April 4, 2024 10:44

Merge remote-tracking branch 'apache/main' into alamb/udaf_better_api

56199fc

Merge remote-tracking branch 'apache/main' into alamb/udaf_better_api

f3dc5db

Remove checks for ORDER BY and IGNORE NULLS

b3add2d

github-actions bot removed the documentation Improvements or additions to documentation label Apr 4, 2024

alamb changed the title ~~Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs and examples~~ Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs Apr 4, 2024

alamb commented Apr 4, 2024

View reviewed changes

alamb mentioned this pull request Apr 4, 2024

Add test for for AggregateUDFImpl with ORDER BY and IGNORE NULLS #9953

Closed

jayzhan211 approved these changes Apr 5, 2024

View reviewed changes

alamb merged commit 2dad904 into apache:main Apr 5, 2024
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor: Improve documentation for AggregateUDFImpl::accumulator and `AccumulatorArgs` #9920

Minor: Improve documentation for AggregateUDFImpl::accumulator and `AccumulatorArgs` #9920

alamb commented Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024

jayzhan211 Apr 3, 2024 •

edited

Loading

alamb Apr 3, 2024

alamb Apr 3, 2024

alamb Apr 3, 2024

jayzhan211 Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading

alamb Apr 4, 2024

alamb left a comment

alamb Apr 4, 2024

jayzhan211 left a comment

alamb commented Apr 5, 2024

Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs #9920

Minor: Improve documentation for AggregateUDFImpl::accumulator and AccumulatorArgs #9920

Conversation

alamb commented Apr 3, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Apr 3, 2024

Choose a reason for hiding this comment

jayzhan211 Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Apr 3, 2024

Choose a reason for hiding this comment

alamb Apr 3, 2024

Choose a reason for hiding this comment

alamb Apr 3, 2024

Choose a reason for hiding this comment

jayzhan211 Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Apr 4, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 4, 2024

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb commented Apr 5, 2024

Minor: Improve documentation for AggregateUDFImpl::accumulator and `AccumulatorArgs` #9920

Minor: Improve documentation for AggregateUDFImpl::accumulator and `AccumulatorArgs` #9920

alamb commented Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading

jayzhan211 Apr 3, 2024 •

edited

Loading