+ + +

I am extremely excited to announce that Apache DataFusion is the +fastest engine for querying Apache Parquet files in ClickBench. It is faster +than DuckDB, chDB and Clickhouse using the same hardware. It also marks +the first time a Rust-based engine holds the top spot, which has previously +been held by traditional C/C++-based engines.

+ +

Apache DataFusion Logo

+ +

ClickBench performance for DataFusion 43.0.0

+ +

Figure 1: 2024-11-16 ClickBench Results for the ‘hot’1 run against the +partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a c6a.4xlarge (16 +CPU / 32 GB RAM) VM. Measurements are relative (1.x) to results using +different hardware.

+ +

Best in class performance on Parquet is now available to anyone. DataFusion’s +open design lets you start quickly with a full featured Query Engine, including +SQL, data formats, catalogs, and more, and then customize any behavior you need. +I predict the continued emergence of new classes of data systems now that +creators can focus the bulk of their innovation on areas such as query +languages, system integrations, and data formats rather than trying to play +catchup with core engine performance.

+ +

ClickBench also includes results for proprietary storage formats, which require +costly load / export steps, making them useful in fewer use cases and thus much +less important than open formats (though the idea of use case specific formats +is interesting2).

+ +

This blog post highlights some of the techniques we used to achieve this +performance, and celebrates the teamwork involved.

+ +

A Strong History of Performance Improvements

+ +

Performance has long been a core focus for DataFusion’s community, and +speed attracts users and contributors. Recently, we seem to have been +even more focused on performance, including in July, 2024 when Mehmet Ozan +Kabak, CEO of Synnada, again suggested focusing on performance. This +got many of us excited (who doesn’t love a challenge!), and we have subsequently +rallied to steadily improve the performance release on release as shown in +Figure 2.

+ +

ClickBench performance results over time for DataFusion

+ +

Figure 2: ClickBench performance improved over 30% between DataFusion 34 +(released Dec. 2023) and DataFusion 43 (released Nov. 2024).

+ +

Like all good optimization efforts, ours took sustained effort as DataFusion ran +out of single 2x performance improvements several years ago. Working together our +community of engineers from around the world3 and all experience levels4 +pulled it off (check out this discussion to get a sense). It may be a “hobo +sandwich5, but it is a tasty one!

+ +

Of course, most of these techniques have been implemented and described before, +but until now they were only available in proprietary systems such as +Vertica, DataBricks +Photon, or +Snowflake or in tightly integrated open source +systems such as DuckDB or +ClickHouse which were not designed to be extended.

+ +

StringView

+ +

Performance improved for all queries when DataFusion switched to using Arrow +StringView. Using StringView “just” saves some copies and avoids one memory +access for certain comparisons. However, these copies and comparisons happen to +occur in many of the hottest loops during query processing, so optimizing them +resulted in measurable performance improvements.

+ +

Illustration of how take works with StringView

+ +

Figure 3: Figure from Using StringView / German Style Strings to Make +Queries Faster: Part 1 showing how StringView saves copying data in many cases.

+ +

Using StringView to make DataFusion faster for ClickBench required substantial +careful, low level optimization work described in Using StringView / German +Style Strings to Make Queries Faster: Part 1 and Part 2. However, it also +required extending the rest of DataFusion’s operations to support the new type. +You can get a sense of the magnitude of the work required by looking at the 100+ +pull requests linked to the epic in arrow-rs +(here) and three major epics +(here, +here and +here) in DataFusion.

+ +

Here is a partial list of people involved in the project (I am sorry to those whom I forgot)

+ + + +

Parquet

+ +

Part of the reason for DataFusion’s speed in ClickBench is reading Parquet files (really) quickly, +which reflects invested effort in the Parquet reading system (see Querying +Parquet with Millisecond Latency )

+ +

The DataFusion ParquetExec (built on the Rust Parquet Implementation) is now the most +sophisticated open source Parquet reader I know of. It has every optimization we +can think of for reading Parquet, including projection pushdown, predicate +pushdown (row group metadata, page index, and bloom filters), limit pushdown, +parallel reading, interleaved I/O, and late materialized filtering (coming soon ™️ +by default). Some recent work from June +recently unblocked a remaining hurdle for enabling late materialized +filtering, and conveniently Xiangpeng Hao is +working on the final piece (no pressure😅)

+ +

Skipping Partial Aggregation When It Doesn’t Help

+ +

Many ClickBench queries are aggregations that summarize millions of rows, a +common task for reporting and dashboarding. DataFusion uses state of the art +two phase aggregation plans. Normally, two phase aggregation works well as the +first phase consolidates many rows immediately after reading, while the data is +still in cache. However, for certain “high cardinality” aggregate queries (that +have large numbers of groups), the two phase aggregation strategy used in +DataFusion was inefficient, +manifesting in relatively slower performance compared to other engines for +ClickBench queries such as

+ +
SELECT "WatchID", "ClientIP", COUNT(*) AS c, ... 
+FROM hits 
+GROUP BY "WatchID", "ClientIP" /* <----- 13M Distinct Groups!!! */
+ORDER BY c DESC 
+LIMIT 10;
+
+ +

For such queries, the first aggregation phase does not significantly +reduce the number of rows, which wastes significant effort. Eduard +Karacharov contributed a dynamic strategy to +bypass the first phase when it is not working efficiently, shown in Figure 4.

+ +

Two phase aggregation diagram from DataFusion API docs annotated to show first phase not helping

+ +

Figure 4: Diagram from DataFusion API docs showing when the multi-phase +grouping is not effective

+ +

Optimized Multi-Column Grouping

+ +

Another method for improving analytic database performance is specialized (aka +highly optimized) versions of operations for different data types, which the +system picks at runtime based on the query. Like other systems, DataFusion has +specialized code for handling different types of group columns. For example, +there is special code that handles GROUP BY int_id and different special +code that handles GROUP BY string_id .

+ +

When a query groups by multiple columns, it is tricker to apply this technique. +For example GROUP BY string_id, int_id and GROUP BY int_id, string_id have +different optimal structures, but it is not possible to include specialized +versions for all possible combinations of group column types.

+ +

DataFusion includes a general Row based mechanism that works for any +combination of column types, but this general mechanism copies each value twice +as shown in Figure 5. The cost of this copy is especially high for variable +length strings and binary data.

+ +

Row based storage for multiple group columns

+ +

Figure 5: Prior to DataFusion 43.0.0, queries with multiple group columns +used Row based group storage and copied each group value twice. This copy +consumes a substantial amount of the query time for queries with many distinct +groups, such as several of the queries in ClickBench.

+ +

Many optimizations in Databases boil down to simply avoiding copies, and this +was no exception. The trick was to figure out how to avoid copies without +causing per-column comparison overhead to dominate or complexity to get out of +hand. In a great example of diligent and disciplined engineering, Jay +Zhan tried several, different approaches until arriving +at the one shipped in DataFusion 43.0.0, shown in Figure 6.

+ +

Column based storage for multiple group columns

+ +

Figure 6: DataFusion 43.0.0’s new columnar group storage copies each group +value exactly once, which is significantly faster when grouping by multiple +columns.

+ +

Huge thanks as well to Emil Ejbyfeldt and +Daniël Heres for their help reviewing and to +Rachelint (kamille) for reviewing and +contributing a faster vectorized append and compare for multiple groups which +will be released in DataFusion 44. The discussion on the ticket is another +great example of the power of the DataFusion community working together to build +great software.

+ +

What’s Next 🚀

+ +

Just as I expect the performance of other engines to improve, DataFusion has +several more performance improvements lined up itself:

+ +
    +
  1. Intermediate results blocked management (thanks again Rachelint (kamille)
  2. +
  3. Enable parquet filter pushdown by default
  4. +
+ +

We are also talking about what to focus on over the next three +months and are always +looking for people to help! If you want to geek out (obsess??) about performance +and other features with engineers from around the world, we would love you to +join us.

+ +

Additional Thanks

+ +

In addition to the people called out above, thanks:

+ +
    +
  1. Patrick McGleenon for running ClickBench and gathering this data (source).
  2. +
  3. Everyone I missed in the shoutouts – there are so many of you. We appreciate everyone.
  4. +
+ +

Conclusion

+ +

I have dreamed about DataFusion being on top of the ClickBench leaderboard for +several years. I often watched with envy improvements in systems backed by large +VC investments, internet companies, or world class research institutions, and +doubted that we could pull off something similar in an open source project with +always limited time.

+ +

The fact that we have now surpassed those other systems in query performance I +think speaks to the power and possibility of focusing on community and aligning +our collective enthusiasm and skills towards a common goal. Of course, being on +the top in any particular benchmark is likely fleeting as other engines will +improve, but so will DataFusion!

+ +

I love working on DataFusion – the people, the quality of the code, my +interactions and the results we have achieved together far surpass my +expectations as well as most of my other software development experiences. I +can’t wait to see what people will build next, and hope to see you +online.

+ +

Notes

+ +
+
    +
  1. +

    Note that DuckDB is slightly faster on the ‘cold’ run. 

    +
  2. +
  3. +

    Want to try your hand at a custom format for ClickBench fame / glory?: Make DataFusion the fastest engine in ClickBench with custom file format 

    +
  4. +
  5. +

    We have contributors from North America, South American, Europe, Asia, Africa and Australia 

    +
  6. +
  7. +

    Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety experienced engineers 

    +
  8. +
  9. +

    Thanks to Andy Pavlo, I love that nomenclature 

    +
  10. +
+
+ +