Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max #13991

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Jan 2, 2025

Which issue does this PR close?

Closes #13987.

Rationale for this change

You are now able to run min and max on lists

What changes are included in this PR?

Added GenericMinMaxAccumulator and GenericSlidingMinMaxAccumulator which calculate the min and max based on the arrow-row crate

Are these changes tested?

For the min and max yes, for the moving min max - no (fuzzy testing will help here a lot to cover the large range of tests)

Are there any user-facing changes?

Yes, users are now allowed to sort lists and every type that is supported by the arrow-row

@rluvaton rluvaton changed the title add support for lists in min feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max Jan 2, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 2, 2025
@rluvaton rluvaton marked this pull request as ready for review January 2, 2025 20:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I reviewed this code and it looks quite nice. Thank you 👌 🏆

The only thing I think is missing is "end to end" query coverage. In addition to the empty input test, I think we need tests specifically for the sliding window min/max

Can you please add such tests as .slt tests -- it would be queries something like

SELECT min(list_col) OVER (ORDER BY other_col 1 ROW PRECEDING 1 ROW FOLLOWING) FROM t

where list_col had nulls in it

Bonus points for tests of:

  1. struct Min/Max

Let mek nw

acc_args.return_type,
)?))
} else {
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern used in other accumulators is slightly different I think -- It would have a function on MaxAccumulator that explicitly declared the types it supported and then fallback to the GenericMaxAccumulator if not.

For example, something like

if MaxAccumulator::supports_type(&acc_args.return_type) {
   Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
} else {
   Ok(Box::new(GenericMaxAccumulator::try_new(
                acc_args.return_type,
            )?))
}

This is a minor detail and not required, I just wanted to point it out

Comment on lines +239 to +241
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to make this less verbose you could probably do something like

Suggested change
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
GenericMaxAccumulator::try_new(
acc_args.return_type,
).map(Box::new)

}

fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
// TODO - should assert getting only one column?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to either add the assert or remove the TODO as part of this PR

@@ -108,11 +108,15 @@ SELECT * FROM data WHERE column2 is not distinct from null;
# Aggregates
###########

query error Internal error: Min/Max accumulator not implemented for type List
query ?
SELECT min(column1) FROM data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add tests for empty input on min/max as well to test boundary conditions? Something like

SELECT MIN(column1) FROM data WHERE false

Which should return a null value?


// Create a null row to use for filtering out nulls from the input
let null_row = {
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?;
let null_array = new_null_array(datatype, 1);

}

#[test]
fn basic_i32_list() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


statement count 0
create table t1 (a int[]) as values ([1,2,3]), ([1,2,1]);

query ?
select min(a) from t1;
----
[1, 2, 1]

statement count 0
drop table t1;

You can create test like this, it is easier to maintain than this rust test

@alamb alamb marked this pull request as draft January 16, 2025 22:22
@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Specifically I think this PR just needs some more tests and it will be good to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min and max should support lists as well
3 participants