-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(udf): POC faster min max accumulator #12677
Conversation
|
||
let input_array = &values[0]; | ||
|
||
for (i, &group_index) in group_indices.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think much of the filter logic is handled by accumulate_indices
datafusion/datafusion/functions-aggregate/src/count.rs
Lines 417 to 424 in f54712d
accumulate_indices( | |
group_indices, | |
values.logical_nulls().as_ref(), | |
opt_filter, | |
|group_index| { | |
self.counts[group_index] += 1; | |
}, | |
); |
You could likely avoid much of this repetition (and likely it would be faster)
It woudl also be nice to avoid the duplication between min /max by using generics. Here is how the primitive one does it (passes in a comparison function)
https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/min_max.rs#L119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I would probably want to do something like this?
accumulate_indices(group_indices, input_array.logical_nulls().as_ref(), opt_filter, |group_index| {
let value = input_array.as_binary_view().value(i);
let value_str = std::str::from_utf8(value).map_err(|e| {
DataFusionError::Execution(format!(
"could not build utf8 from binary view {}",
e
))
}).unwrap();
if self.states[group_index].is_empty() {
self.states[group_index] = value_str.to_string();
} else {
let curr_value_bytes = self.states[group_index].as_bytes();
if value < curr_value_bytes {
self.states[group_index] = value_str.parse().unwrap();
}
}
});
And then make this generic. I.E. I can pass a generic function in instead of:
if self.states[group_index].is_empty() {
self.states[group_index] = value_str.to_string();
} else {
let curr_value_bytes = self.states[group_index].as_bytes();
if value < curr_value_bytes {
self.states[group_index] = value_str.parse().unwrap();
}
}
Afterwards I can likely use a const generic for deciding how to down-cast here with string array or string view?
let value = input_array.as_binary_view().value(i);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devanbenz -- I think this is on the right track
What I suggest is:
- We start with
StringArray
(rather than StringViewArray) - Run some benchmarks (clickbench) -- I can do this if it would help
All in all, thanks again
@alamb thanks for taking a peek. Will go ahead and implement this for |
BTW here is an example of using a const generic: #12703 |
let value = input_array.as_binary_view().value(i); | ||
|
||
let value_str = std::str::from_utf8(value).map_err(|e| { | ||
DataFusionError::Execution(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could replace with macro exec_err
return; | ||
} | ||
|
||
let value: &[u8] = if VIEW { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Optimize downcasts (remove them)
input_array.as_binary::<i32>().value(i) | ||
}; | ||
|
||
let value_str = std::str::from_utf8(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Optimize from_utf8
and remove this validation
let mut builder = BinaryViewBuilder::new(); | ||
|
||
for i in 0..values.len() { | ||
let value = input_array.as_binary_view().value(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Remove downcasts
Some other notes I had from our talkL
Should be a hoot Thanks again |
@devanbenz -- I am interested in pushing this one along. If I found some time, would you mind if I spent some time tweaking it? |
Please feel free to, I've been spending some time this morning trying to get it working with just
This is certainly a task that has made me realize how much more I need to learn about Arrow semantics 🫡 😆 and just memory operations in general. I definitely take on more of a 'hacker' way of coding like just throwing things at the editor until they work in instances like this where I don't fully understand the underlying mechanisms. I have so much more to learn 😮 |
Why replace Vec with Vec? 🤔 Doesn't |
I started hacking on this in #12792 -- not quite done yet but we are getting close |
BTW I think #12792 has all the right code -- I need to write a few more tests, but it has the basic idea completed |
Given #12792 has been merged, let's close this one for now. Thank you @devanbenz and everyone else for the help. |
Which issue does this PR close?
Closes #6906
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?