-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for grouping by variable length columns (strings) #9403
Comments
String/binary prefix stored in place similar to |
Yes, indeed -- I think an approach similar to or maybe even using |
BTW I may play around with this approach as a fun side project if/when I have time. In general, my high level strategy would be to hack up If it did, then I would spend the time obsessing over / optimizing the corner cases |
BTW the DuckDB paper (which I have not yet read) seems to describe a very similar layout for variable length strings: https://duckdb.org/2024/03/29/external-aggregation |
I will take a look on this first 👀 |
I try to an approach that take multiple columns into consideration but found that the time spend on self.map.insert_accounted(
new_header,
|header| header.hash,
&mut self.map_size,
); The hash entry includes vector like struct Entry<O>
where
O: OffsetSizeTrait,
{
/// hash of the value (stored to avoid recomputing it in hash table check)
hash: u64,
/// each variable length column's offset or inline(short string)
offset_or_inline: Vec<usize>,
/// each variable length column's length. None for null.
len: Vec<Option<O>>,
group_id: usize,
} It seems that we need avoid adding Ref: #10937 I will try converting variable length column info (maybe group values index) from to ArrayRef, and convert them to Rows together 🤔 . |
#10976 is a cool approach. I have been thinking about this more, especially in context of @jayzhan211 's comment here #10918 (comment) Here is one potential (intermediate) phase that might be worth exploring. Specifically, change the output type of the first phase of grouping to be StringView -- the reason is that this might help performance but would keep the required changes localized to HashAggregateExec 🤔
|
Want to share some of my findings here. My approach is very similar to #10976, except that it will emit The performance is similar to (slightly slower) first converting to row-group then convert back. Flamegraph shows that building the ==== update: I tried longer strings, i.e., "URL" and my approach is ~10% faster than baseline |
One crazy idea I've been thinking: if the string view is loaded from dictionary-encoded parquet, then the underlying buffers are unique (i.e., no duplicated values), then the view values are essentially the hash of the string -> if two strings are the same, the share the same view value; if they are different, they must have different view values. I'm not sure how fast we can get from this, my guess is that the performance gain may not justify the changes it requires. |
That is probably correct for arrays that share the same data page. But once the next page is emitted the dictionary changes I think and then therefore the dictionary values may be different and the views are no longer unique 🤔 |
FWIW I filed #11680 to track some ideas of reducing hash overhead |
I believe that @jayzhan211 's work in #12269 effectively closes this item There is clearly (always) more to make better, but this ticket I think is done now. Thanks again |
Is your feature request related to a problem or challenge?
As always I would like faster aggregation performance
Describe the solution you'd like
clickbench, Q17 and Q18 include
This is an Int 64 and string
In some profiling of Q19,
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
I found that 20-30% of the time is spent going from Array --> Row or Row --> Array.Thus I think adding some special handling for variable length data vs fixed length data in the group management may help
Background
GroupValuesRows
, used for the queries above, is here:https://github.com/apache/arrow-datafusion/blob/edec4189242ab07ac65967490537d77e776aad5c/datafusion/physical-plan/src/aggregates/group_values/row.rs#L32
Given a query like
SELECT ... GROUP BY i1, i2, s1
, wherei1
andi2
are integer columns ands1
is a string columnFor input looks like this:
GroupValuesRows
will doOne downside of this approach is that for "large" strings, a substantial amount of copying is required simply to check if the group is already present
Describe alternatives you've considered
The idea is to use a modified version of the group keys where the fixed length part still uses row format, but the variable length columns use an approach like in GroupValuesByes
Something like
Additional context
No response
The text was updated successfully, but these errors were encountered: