[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

alamb · 2023-08-05T18:15:21Z

Note: This describes work @tustvold @Dandandan and I did in DataFusion 28.0.0. This content was originally published on the InfluxData Blog but since it is general applicable to Apache Arrow DataFusion I would like to syndicate it here becase:

This is a form where the community can comment / keep it up to date via PR
It is hosted on a platform with a different lifetime than a company blog

This is the same model we followed with https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ which was also republished on the arrow blog after the InfluxData blog

It also gives me an example to use my original ASCII art diagrams :)

…ion 28.0.0

alamb · 2023-08-09T17:17:30Z

I plan to publish this sometime early next week (2023-08-14 or so), to ensure there has been at least a week for anyone who is interested to review

Here is the discussion on mailing list: https://lists.apache.org/thread/4lyk9jycr0o6qv5zo5bsw2q9mvvdsp7z

Please let me know if anyone would like additional time to review

yjshen · 2023-09-10T19:59:19Z

_posts/2023-08-05-datafusion_fast_grouping.md

+allocation using the arrow Row format
+```
+
+**Figure 5**: Hash group operator structure in DataFusion `28.0.0`. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single `GroupsAccumulator` stores the per-aggregate state for _all_ groups.


Primitive group values are also stored in a single allocation using Vec<T::Native>, not directly in the hash table?

https://github.com/apache/arrow-datafusion/blob/63ccd4ab8b5852a7c7928b7d41209c57ef5e1af4/datafusion/core/src/physical_plan/aggregates/group_values/primitive.rs#L88-L89

This was a later modification - apache/datafusion#7043

alamb added 2 commits August 5, 2023 14:12

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFus…

ed35da6

…ion 28.0.0

Fix formatting, add charts

e39d614

alamb marked this pull request as ready for review August 5, 2023 18:45

alamb mentioned this pull request Aug 5, 2023

Write a blog post fast Vectorized grouping for high cardinality apache/datafusion#6988

Closed

order authors consistently

a865540

tustvold approved these changes Aug 8, 2023

View reviewed changes

Dandandan approved these changes Aug 9, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into alamb/df_fast_grouping

8e16d83

alamb merged commit 9b0e78a into apache:main Aug 14, 2023

alamb deleted the alamb/df_fast_grouping branch August 14, 2023 10:36

yjshen reviewed Sep 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

alamb commented Aug 5, 2023 •

edited

Loading

alamb commented Aug 9, 2023

yjshen Sep 10, 2023

tustvold Sep 10, 2023

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

Conversation

alamb commented Aug 5, 2023 • edited Loading

alamb commented Aug 9, 2023

yjshen Sep 10, 2023

Choose a reason for hiding this comment

tustvold Sep 10, 2023

Choose a reason for hiding this comment

alamb commented Aug 5, 2023 •

edited

Loading