Update ClickBench benchmarks with DataFusion `43.0.0` #13099

alamb · 2024-10-24T20:03:00Z

Is your feature request related to a problem or challenge?

Requires

Release DataFusion 43.0.0 #12470

Once DataFusion 43.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 40 and there have been significant improvements since then. See for more details:

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Describe the solution you'd like

Perhaps we can follow the model of ClickHouse/ClickBench#210 (thanks @pmcgleenon )

We will also need to update DataFusion to apply the new binary_as_string option added by @goldmedal in #12816. TLDR is that we need to update the create table statements to have the OPTIONS ('binary_as_string' 'true') clause

https://github.com/ClickHouse/ClickBench/blob/main/datafusion/create_partitioned.sql

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'partitioned'
OPTIONS ('binary_as_string' 'true');

Note this is the same as the DuckDB runner, as explained in #12788

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

pmcgleenon · 2024-11-12T20:53:05Z

Hi @alamb are we good to proceed with this now that 43.0 has been released (#13254)?

alamb · 2024-11-13T07:00:43Z

Hi @pmcgleenon

I just double checked and this should be good to go now -- the clickbench runner uses datafusion-cli rather than the python bindings (which haven't yet released a version 43.0.0)

Thank you so much

pmcgleenon · 2024-11-13T07:56:43Z

take

pmcgleenon · 2024-11-15T09:18:22Z

Hi @alamb here are the initial results.

I've created a composite html file with results for DataFusion releases 33, 34, 36, 40 and 43 so we can compare across the previous releases.

clickbench.html.zip

Single results

Partitioned results

Before running the tests I've updated the SQL to include OPTIONS ('binary_as_string' 'true')
ClickHouse/ClickBench@30fa096

It looks like we've seen some significant gains across a number of queries. Are these results in line with your expectations?

If so I can update the Clickbench to update it with the latest results.

alamb · 2024-11-15T10:01:09Z

It looks like we've seen some significant gains across a number of queries. Are these results in line with your expectations?

Yes, indeed, @korowa @Rachelint @jayzhan211 @XiangpengHao and many others have been hard at work improving the speed. For an idea of what was done: #12821 (comment)

It is sad that we are not yet the fastest engine but interestingly it looks like many places where we are the farthest from the top are the really fast queries (that take 100s of ms)

Thank you again @pmcgleenon

Dandandan · 2024-11-15T10:14:25Z

Looks like we're the fastest for Parquet on c6a.4xlarge? The difference is 16 cores vs 192 cores (and not shared/virtualized).

alamb · 2024-11-15T10:37:15Z

YES! you are totally right @Dandandan

With some more finagling, I filtered for only c6a.metal 500gb gp2:

WOOHOO -- once this gets published it will be time to write a blog post !

Rachelint · 2024-11-15T10:50:20Z

After profiling, I believe we still can get obvious improvement after finishing #11943 and its related epic!

pmcgleenon · 2024-11-15T11:04:23Z

It is sad that we are not yet the fastest engine but interestingly it looks like many places where we are the farthest from the top are the really fast queries (that take 100s of ms)

That's an interesting observation. Maybe we can look at benchmarking on c6a.metal at some point - it might highlight some more areas which could be improved and would show how DataFusion scales out to 192 vCPUs

pmcgleenon · 2024-11-15T11:55:20Z

I've created a PR on ClickBench to update the results ClickHouse/ClickBench#251

Congratulations everyone on another amazing project milestone!

alamb · 2024-11-15T11:58:11Z

After profiling, I believe we still can get obvious improvement after finishing #11943 and its related epic!

Indeed. I also wanted to point out that item did not make it into 43.0.0 (it was merged shortly after we made the release candidate) so I expect the 44.0.0 release to be even better

Support vectorized append and compare for multi group by #12996

pmcgleenon · 2024-11-15T12:24:16Z

The ClickBench PR has been merged, so the v43 results are available at https://benchmark.clickhouse.com/

alamb · 2024-11-15T13:05:22Z

Thank you so much @pmcgleenon ! I'll file some follow on tickets as well

link

Bam!

I am going to write up a blog post about this as i thnik it is an excellent example of what it takes to make queries fast as well as a great example of team work (ticket for tracking blog: #13436)

alamb · 2024-11-18T18:52:27Z

Thanks again @pmcgleenon ❤️

rluvaton · 2024-11-22T07:31:36Z

How is it compared to polars in performance in this benchmark?

alamb · 2024-11-22T11:54:57Z

How is it compared to polars in performance in this benchmark?

I don't think Polars can run SQL so it can't run this benchmark

Dandandan · 2024-11-22T12:46:58Z

Polars is included in the benchmark (via in memory DataFrame API) and runs on the biggest machine.
It seems DataFusion is faster on quite some queries even though it reads parquet from disk and runs on a smaller machine.

Dandandan · 2024-11-22T12:49:36Z

See results

alamb · 2024-11-22T16:02:52Z

See results

This is quite interesting. It seems like Polars may also compute min/max/count/sum during load as some queries take 0 time (but there is 275s of load time).

Still 🎣 this would be an excellent optimization for anyone else working on

[DISCUSSION] Challenge: Make DataFusion the fastest engine in ClickBench with custom file format #13448

rluvaton · 2024-11-23T17:27:00Z

The polars test is in python so I don't think that is a fair comparison

Dandandan · 2024-11-23T19:02:07Z

The polars test is in python so I don't think that is a fair comparison

I don't think that's very relevant, as the execution is done natively (Polars is written in Rust). Likewise, DataFusion also would perform about the same in Python.

Running on the biggest machine / and / or reading everything in memory actually gives Polars an advantage, so I think it’s even more impressive to see the difference here.

I think it mainly shows the different optimizations done last ~2 years to make (tricky/expensive) aggregations fast. It may also show some problems in Polars, e.g. with compiling regexes.

waruto210 · 2024-12-03T12:53:44Z

I'm very pleased to see DataFusion achieving such results. However, I encountered some anomalies while trying to reproduce the benchmark, so I'd like to ask for some guidance.
Following the scripts in the ClickBench repository, I ran ClickBench on partitioned parquet files.
During the cold run phase, DataFusion was about 20% faster than ClickHouse, but in the hot run phase, DataFusion was about 20% slower than ClickHouse.
We used a machine with specifications similar to c6a.4xlarge, featuring 16 vCPUs, 32GB of memory, and an SSD with 2GB/s bandwidth. Additionally, we ran ClickBench on a machine with similar specifications but using HDD, and the results were consistent - DataFusion was slower than ClickHouse in the hot run phase.
This was quite unexpected, and I'd like to know if there might be some configuration/compilation parameters that could be causing this issue.

@alamb I would really appreciate any advice you could give when you have a moment.

alamb · 2024-12-03T12:59:02Z

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

waruto210 · 2024-12-03T13:06:57Z

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

For parquet files, ClickHouse uses local mode. In my understanding, in local mode, ClickHouse, like DataFusion, is a stateless query engine with only Linux page cache available. So I'm very surprised by these results. I will conduct more experiments to try to find out the reason.

waruto210 · 2024-12-04T08:11:00Z

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

After conducting more experiments, I made some unexpected discoveries:

In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase.

@alamb FYI

alamb · 2024-12-06T20:57:14Z

Thank you for the update @waruto210

jayzhan211 · 2024-12-08T07:51:15Z

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)
DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)
It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

After conducting more experiments, I made some unexpected discoveries:

In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase.

@alamb FYI

Do you know which queries are we still lag behind in the old version of clickhouse?

waruto210 · 2024-12-09T03:08:24Z

Do you know which queries are we still lag behind in the old version of clickhouse?

On our test machines, roughly the following queries are slower:

Q4: SELECT count(DISTINCT UserID) FROM hits LIMIT 20000
Q8: SELECT RegionID, count(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10
Q13：SELECT SearchPhrase, count(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10
Q29: SELECT sum(ResolutionWidth), sum(ResolutionWidth + 1), sum(ResolutionWidth + 2), sum(ResolutionWidth + 3), sum(ResolutionWidth + 4), sum(ResolutionWidth + 5), sum(ResolutionWidth + 6), sum(ResolutionWidth + 7), sum(ResolutionWidth + 8), sum(ResolutionWidth + 9), sum(ResolutionWidth + 10), sum(ResolutionWidth + 11), sum(ResolutionWidth + 12), sum(ResolutionWidth + 13), sum(ResolutionWidth + 14), sum(ResolutionWidth + 15), sum(ResolutionWidth + 16), sum(ResolutionWidth + 17), sum(ResolutionWidth + 18), sum(ResolutionWidth + 19), sum(ResolutionWidth + 20), sum(ResolutionWidth + 21), sum(ResolutionWidth + 22), sum(ResolutionWidth + 23), sum(ResolutionWidth + 24), sum(ResolutionWidth + 25), sum(ResolutionWidth + 26), sum(ResolutionWidth + 27), sum(ResolutionWidth + 28), sum(ResolutionWidth + 29), sum(ResolutionWidth + 30), sum(ResolutionWidth + 31), sum(ResolutionWidth + 32), sum(ResolutionWidth + 33), sum(ResolutionWidth + 34), sum(ResolutionWidth + 35), sum(ResolutionWidth + 36), sum(ResolutionWidth + 37), sum(ResolutionWidth + 38), sum(ResolutionWidth + 39), sum(ResolutionWidth + 40), sum(ResolutionWidth + 41), sum(ResolutionWidth + 42), sum(ResolutionWidth + 43), sum(ResolutionWidth + 44), sum(ResolutionWidth + 45), sum(ResolutionWidth + 46), sum(ResolutionWidth + 47), sum(ResolutionWidth + 48), sum(ResolutionWidth + 49), sum(ResolutionWidth + 50), sum(ResolutionWidth + 51), sum(ResolutionWidth + 52), sum(ResolutionWidth + 53), sum(ResolutionWidth + 54), sum(ResolutionWidth + 55), sum(ResolutionWidth + 56), sum(ResolutionWidth + 57), sum(ResolutionWidth + 58), sum(ResolutionWidth + 59), sum(ResolutionWidth + 60), sum(ResolutionWidth + 61), sum(ResolutionWidth + 62), sum(ResolutionWidth + 63), sum(ResolutionWidth + 64), sum(ResolutionWidth + 65), sum(ResolutionWidth + 66), sum(ResolutionWidth + 67), sum(ResolutionWidth + 68), sum(ResolutionWidth + 69), sum(ResolutionWidth + 70), sum(ResolutionWidth + 71), sum(ResolutionWidth + 72), sum(ResolutionWidth + 73), sum(ResolutionWidth + 74), sum(ResolutionWidth + 75), sum(ResolutionWidth + 76), sum(ResolutionWidth + 77), sum(ResolutionWidth + 78), sum(ResolutionWidth + 79), sum(ResolutionWidth + 80), sum(ResolutionWidth + 81), sum(ResolutionWidth + 82), sum(ResolutionWidth + 83), sum(ResolutionWidth + 84), sum(ResolutionWidth + 85), sum(ResolutionWidth + 86), sum(ResolutionWidth + 87), sum(ResolutionWidth + 88), sum(ResolutionWidth + 89) FROM hits LIMIT 20000
Q35：SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, count(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10

alamb added the enhancement New feature or request label Oct 24, 2024

alamb changed the title ~~Update ClickBench benchmarks with DataFusion 43~~ Update ClickBench benchmarks with DataFusion `43.0.0 Oct 24, 2024

alamb changed the title ~~Update ClickBench benchmarks with DataFusion `43.0.0~~ Update ClickBench benchmarks with DataFusion 43.0.0 Oct 24, 2024

alamb mentioned this issue Oct 24, 2024

Introduce binary_as_string parquet option, upgrade to arrow/parquet 53.2.0 #12816

Merged

github-actions bot assigned pmcgleenon Nov 13, 2024

alamb mentioned this issue Nov 15, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Closed

alamb closed this as completed Nov 18, 2024

alamb mentioned this issue Nov 20, 2024

Nov 20. 2024: This week in DataFusion #13503

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ClickBench benchmarks with DataFusion `43.0.0` #13099

Update ClickBench benchmarks with DataFusion `43.0.0` #13099

alamb commented Oct 24, 2024 •

edited

Loading

pmcgleenon commented Nov 12, 2024

alamb commented Nov 13, 2024

pmcgleenon commented Nov 13, 2024

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024

Dandandan commented Nov 15, 2024

alamb commented Nov 15, 2024

Rachelint commented Nov 15, 2024

pmcgleenon commented Nov 15, 2024

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024 •

edited

Loading

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024 •

edited

Loading

alamb commented Nov 18, 2024

rluvaton commented Nov 22, 2024

alamb commented Nov 22, 2024

Dandandan commented Nov 22, 2024

Dandandan commented Nov 22, 2024

alamb commented Nov 22, 2024

rluvaton commented Nov 23, 2024 •

edited

Loading

Dandandan commented Nov 23, 2024 •

edited

Loading

waruto210 commented Dec 3, 2024

alamb commented Dec 3, 2024

waruto210 commented Dec 3, 2024

waruto210 commented Dec 4, 2024

alamb commented Dec 6, 2024

jayzhan211 commented Dec 8, 2024

waruto210 commented Dec 9, 2024 •

edited

Loading

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

Comments

alamb commented Oct 24, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

pmcgleenon commented Nov 12, 2024

alamb commented Nov 13, 2024

pmcgleenon commented Nov 13, 2024

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024

Dandandan commented Nov 15, 2024

alamb commented Nov 15, 2024

Rachelint commented Nov 15, 2024

pmcgleenon commented Nov 15, 2024

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024 • edited Loading

pmcgleenon commented Nov 15, 2024

alamb commented Nov 15, 2024 • edited Loading

alamb commented Nov 18, 2024

rluvaton commented Nov 22, 2024

alamb commented Nov 22, 2024

Dandandan commented Nov 22, 2024

Dandandan commented Nov 22, 2024

alamb commented Nov 22, 2024

rluvaton commented Nov 23, 2024 • edited Loading

Dandandan commented Nov 23, 2024 • edited Loading

waruto210 commented Dec 3, 2024

alamb commented Dec 3, 2024

waruto210 commented Dec 3, 2024

waruto210 commented Dec 4, 2024

alamb commented Dec 6, 2024

jayzhan211 commented Dec 8, 2024

waruto210 commented Dec 9, 2024 • edited Loading

Update ClickBench benchmarks with DataFusion `43.0.0` #13099

Update ClickBench benchmarks with DataFusion `43.0.0` #13099

alamb commented Oct 24, 2024 •

edited

Loading

alamb commented Nov 15, 2024 •

edited

Loading

alamb commented Nov 15, 2024 •

edited

Loading

rluvaton commented Nov 23, 2024 •

edited

Loading

Dandandan commented Nov 23, 2024 •

edited

Loading

waruto210 commented Dec 9, 2024 •

edited

Loading