Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0 #386

Merged
merged 4 commits into from
Aug 14, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 5, 2023

Closes apache/datafusion#6988

Note: This describes work @tustvold @Dandandan and I did in DataFusion 28.0.0. This content was originally published on the InfluxData Blog but since it is general applicable to Apache Arrow DataFusion I would like to syndicate it here becase:

  1. This is a form where the community can comment / keep it up to date via PR
  2. It is hosted on a platform with a different lifetime than a company blog

This is the same model we followed with https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ which was also republished on the arrow blog after the InfluxData blog

It also gives me an example to use my original ASCII art diagrams :)

@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2023

I plan to publish this sometime early next week (2023-08-14 or so), to ensure there has been at least a week for anyone who is interested to review

Here is the discussion on mailing list: https://lists.apache.org/thread/4lyk9jycr0o6qv5zo5bsw2q9mvvdsp7z

Please let me know if anyone would like additional time to review

@alamb alamb merged commit 9b0e78a into apache:main Aug 14, 2023
@alamb alamb deleted the alamb/df_fast_grouping branch August 14, 2023 10:36
allocation using the arrow Row format
```

**Figure 5**: Hash group operator structure in DataFusion `28.0.0`. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single `GroupsAccumulator` stores the per-aggregate state for _all_ groups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primitive group values are also stored in a single allocation using Vec<T::Native>, not directly in the hash table?

https://github.com/apache/arrow-datafusion/blob/63ccd4ab8b5852a7c7928b7d41209c57ef5e1af4/datafusion/core/src/physical_plan/aggregates/group_values/primitive.rs#L88-L89

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a later modification - apache/datafusion#7043

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write a blog post fast Vectorized grouping for high cardinality
4 participants