-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datafusion-functions-aggregate): add support for lists and other nested types in min
and max
#13991
base: main
Are you sure you want to change the base?
Conversation
min
and max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton -- I reviewed this code and it looks quite nice. Thank you 👌 🏆
The only thing I think is missing is "end to end" query coverage. In addition to the empty input test, I think we need tests specifically for the sliding window min/max
Can you please add such tests as .slt
tests -- it would be queries something like
SELECT min(list_col) OVER (ORDER BY other_col 1 ROW PRECEDING 1 ROW FOLLOWING) FROM t
where list_col had nulls in it
Bonus points for tests of:
struct
Min/Max
Let mek nw
acc_args.return_type, | ||
)?)) | ||
} else { | ||
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern used in other accumulators is slightly different I think -- It would have a function on MaxAccumulator
that explicitly declared the types it supported and then fallback to the GenericMaxAccumulator
if not.
For example, something like
if MaxAccumulator::supports_type(&acc_args.return_type) {
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
} else {
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
}
This is a minor detail and not required, I just wanted to point it out
Ok(Box::new(GenericMaxAccumulator::try_new( | ||
acc_args.return_type, | ||
)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to make this less verbose you could probably do something like
Ok(Box::new(GenericMaxAccumulator::try_new( | |
acc_args.return_type, | |
)?)) | |
GenericMaxAccumulator::try_new( | |
acc_args.return_type, | |
).map(Box::new) |
} | ||
|
||
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
// TODO - should assert getting only one column? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to either add the assert or remove the TODO as part of this PR
@@ -108,11 +108,15 @@ SELECT * FROM data WHERE column2 is not distinct from null; | |||
# Aggregates | |||
########### | |||
|
|||
query error Internal error: Min/Max accumulator not implemented for type List | |||
query ? | |||
SELECT min(column1) FROM data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add tests for empty input on min/max as well to test boundary conditions? Something like
SELECT MIN(column1) FROM data WHERE false
Which should return a null value?
|
||
// Create a null row to use for filtering out nulls from the input | ||
let null_row = { | ||
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?; | |
let null_array = new_null_array(datatype, 1); |
} | ||
|
||
#[test] | ||
fn basic_i32_list() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statement count 0
create table t1 (a int[]) as values ([1,2,3]), ([1,2,1]);
query ?
select min(a) from t1;
----
[1, 2, 1]
statement count 0
drop table t1;
You can create test like this, it is easier to maintain than this rust test
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look Specifically I think this PR just needs some more tests and it will be good to go |
Which issue does this PR close?
Closes #13987.
Rationale for this change
You are now able to run min and max on lists
What changes are included in this PR?
Added
GenericMinMaxAccumulator
andGenericSlidingMinMaxAccumulator
which calculate the min and max based on thearrow-row
crateAre these changes tested?
For the min and max yes, for the moving min max - no (fuzzy testing will help here a lot to cover the large range of tests)
Are there any user-facing changes?
Yes, users are now allowed to sort lists and every type that is supported by the
arrow-row