-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906
Comments
One observation here is that Maybe there is some binary usecase where it is important (e.g. embeddings 🤔 ) |
@alamb is there anyone working on this + is this issue still relevant? I would love to tackle it as it seems like an interesting feature/optimization. |
I dont know of anyone working on this @devanbenz -- but I also don't know of any benchmarks or actual queries that use |
It actually turns out that datafusion/benchmarks/queries/clickbench/queries.sql Lines 22 to 23 in a08f923
I don't think However, while working on #12092 it turns out that |
Background(I will make a PR shortly to add this to the actual datafusion docs)
Today, String / Binary min/max values are implemented using
|
Potential DesignOne high level idea is to build a data structure that uses the same internal format (views/buffers) as
In this design, the current value for each group is stored in two parts (as described on arrow docs)
As new batches are updated, each Benefits of design:
I believe (though we will have to verify it) that the conversion from MutableStringViewBuilder to just StringArray (not StringViewArray) should also be better than the current Potential challengesI think the trickiest part of this code, other the low level code optimizations is that as min/max values are replaced, data in the variable length buffer will become "garbage" (not reachable) thus consuming more memory than necessary:
I think this means the code will need something Random ThoughtsThoughts: maybe this structure (MutableStringViewBuilder??) could be upstreamed eventually |
@devanbenz I think this would be a fun and interesting project as well as valuable to DataFusion. However, I also think it is pretty advanced -- I would enjoy helping with it, but also maybe @Rachelint or @jayzhan211 are interested in helping out 🤔 |
Great! I'll get started on this later in the week/over the weekend :) Will likely bug folks as I require help haha 😆 |
take |
In terms of implementation, what I suggest is:
For the POC here is the reproducer I recommend: Step 1. Get
|
Seems really interesting, I am reading the related disscusions. |
BTW in case anyone is interested, I recorded a short video on how to make these flamegraphs: https://youtu.be/2z11xtYw_xs I will add a link to that in the docs later |
The challenge of
|
I think the overhead is actually mostly that there is an additional (small) allocation for each
I suppose we could potentially update the values as long as the new strings were shorter 🤔
|
@Rachelint your implicit idea of using It would at least avoid calling |
🤔 Can we still use the string view like approach to store states #6906 (comment), but for the uninlined state, we use a single For example, if all states are uninlined(len > 12), the output
I am not familiar enough with |
Yes... At least it will be better than now, even we just use |
I think using a single |
Ok, for And when converting it to For the short strings(<=12), it can avoid allocating |
Seems like a reasonable place to start in my opinion. If we want to get more sophisticated at a later time we can try something more exotic. I suspect there will be times when individual allocations will be faster and times when a buffer will be faster and there will be memory consumption tradeoffs as well. TLDR we should implement something and as long as it is better than what is currently going on that will be good. |
Cool, I'm assuming this change didn't impact the
|
BTW I was thinking more about this issue -- while a native Min/Max for strings / stringview will help, I have an idea that might make it simply it an added bonus: pushing the cast down into the parquet reader #12509 (comment) |
This would only work for parquet correct? It would effectively be a part of the TableScan within the logical plan? |
So this would effectively be using a |
I have had more time to take a look at this and sort of just wrap my head around how datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs Line 475 in a35d007
slice_and_maybe_filter I'm under the assumption that this is happening due to the difference between the BinaryView and the Binary Scalar values. Taking a look at https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/ I understand that BinaryView is effectively a non-contiguous structure where as Binary is contiguous so it is easily sliced. So the idea here is to effectively change the underlying state in which the accumulator structure receives that data thus making it either easier to call slice or remove the calling of slice entirely?
|
I think what @alamb means is that just simply using And after improving it to be not worse than |
I guess the current goal is to remove the calling of slice, and get an at least not worse performance than |
Yes indeed. Thank you
💯 |
Awesome thank you @Rachelint. I've begun working on a small POC locally with this information. It took me a bit of reading up on context to get going for sure. After reading through https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html and https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html, tracing through the code locally, and reading through the discussion a few times I now have a better understanding of where to start, the expected outcome and a path to get there. |
Nice! If you want to put up a POC / draft PR I would be more than happy to help work on it with you |
Sounds good I pushed up what I have so far as a draft. There are a few things I would like to do including making the accumulator more generic so that I use a single method for min or max and pass an aggregate fn similar to how other UDFs do it. |
I will check out #12677 -- I hoped to do it today but I appear to be almost out of time |
I think we can store (len, prefix, index) for string/binary view cases. We then compare the value based on len + inline or prefix first, if they are the same we look up the actual value with index. In this way, we don't have the garbage collect issue, since we store the index instead of the whole string. |
I am a bit confused about where we store actual string values? We still store it using I think what @Dandandan memtioned in #6800 (comment) seems to be performance best impl? |
I think the garbage collection issue happens when the current "min" for a group is updated to a new value. We have to copy the new min value into the buffer and adjust the offset, but the old min value is still in the string view buffer |
My approach only works within batch not across batch. The garbage collection issue here is that whenever we got a better "min", we need to append the bytes and the old bytes is now "garbage". But given a |
this makes a lot of sense (and is similar to what the current non groups accumulator does -- finds the min per batch and then updates the current overall min from that, rather than updating the min for each input row). |
I started hacking on this today: #12792. I am fighting with the generics, but I think I am pretty close. |
Ok, I think #12792 is ready to review. It has a nice 10% improvement on Q28 and avoids the slowdown when string view is enabled |
Is your feature request related to a problem or challenge?
#6904 introduces some fancy new hashing and ways to implement aggregates
See related blog post: https://datafusion.apache.org/blog/2023/08/05/datafusion_fast_grouping/
min/max for strings (
StringArray
/LargeStringArray
, etc) now uses the slowerAccumulator
implementation which could be made much fasterDescribe the solution you'd like
I would like to implement a fast
GroupsAccumulator
for Min/MaxDescribe alternatives you've considered
here is one potential way to implement it:
We could store the current minimum for all groups in the same Rows 🤔 and track an index into that Rows for the current minimum for each group.
This would require an extra copy of the input values, but it could probably be vectorized pretty well, as shown in the following diagram.
Sorry what I meant was something like the following where the accumulator only stored the current minimum values.
This approach would potentially end up with
min_storage
being full of "garbage" if many batches had new minumums, but I think we could heuristically "compact"min_storage
(if it had2*num_groups
, for example) if it got too largeSee #6800 (comment) for more details
Additional context
No response
The text was updated successfully, but these errors were encountered: