Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

Closed
1 task done
alamb opened this issue Oct 24, 2024 · 28 comments
Closed
1 task done

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

alamb opened this issue Oct 24, 2024 · 28 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Is your feature request related to a problem or challenge?

Like #11567

Requires

Once DataFusion 43.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 40 and there have been significant improvements since then. See for more details:

Describe the solution you'd like

Perhaps we can follow the model of ClickHouse/ClickBench#210 (thanks @pmcgleenon )

We will also need to update DataFusion to apply the new binary_as_string option added by @goldmedal in #12816. TLDR is that we need to update the create table statements to have the OPTIONS ('binary_as_string' 'true') clause

https://github.com/ClickHouse/ClickBench/blob/main/datafusion/create_partitioned.sql

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'partitioned'
OPTIONS ('binary_as_string' 'true');

Note this is the same as the DuckDB runner, as explained in #12788

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 24, 2024
@alamb alamb changed the title Update ClickBench benchmarks with DataFusion 43 Update ClickBench benchmarks with DataFusion `43.0.0 Oct 24, 2024
@alamb alamb changed the title Update ClickBench benchmarks with DataFusion `43.0.0 Update ClickBench benchmarks with DataFusion 43.0.0 Oct 24, 2024
@pmcgleenon
Copy link
Contributor

Hi @alamb are we good to proceed with this now that 43.0 has been released (#13254)?

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2024

Hi @pmcgleenon

I just double checked and this should be good to go now -- the clickbench runner uses datafusion-cli rather than the python bindings (which haven't yet released a version 43.0.0)

Thank you so much

@pmcgleenon
Copy link
Contributor

take

@pmcgleenon
Copy link
Contributor

Hi @alamb here are the initial results.

I've created a composite html file with results for DataFusion releases 33, 34, 36, 40 and 43 so we can compare across the previous releases.

clickbench.html.zip

Single results

Screenshot 2024-11-15 at 09 08 45

Partitioned results

Screenshot 2024-11-15 at 09 12 14

Before running the tests I've updated the SQL to include OPTIONS ('binary_as_string' 'true')
ClickHouse/ClickBench@30fa096

It looks like we've seen some significant gains across a number of queries. Are these results in line with your expectations?

If so I can update the Clickbench to update it with the latest results.

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2024

It looks like we've seen some significant gains across a number of queries. Are these results in line with your expectations?

Yes, indeed, @korowa @Rachelint @jayzhan211 @XiangpengHao and many others have been hard at work improving the speed. For an idea of what was done: #12821 (comment)

It is sad that we are not yet the fastest engine but interestingly it looks like many places where we are the farthest from the top are the really fast queries (that take 100s of ms)

Thank you again @pmcgleenon

@Dandandan
Copy link
Contributor

Looks like we're the fastest for Parquet on c6a.4xlarge? The difference is 16 cores vs 192 cores (and not shared/virtualized).
image

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2024

YES! you are totally right @Dandandan

With some more finagling, I filtered for only c6a.metal 500gb gp2:

Screenshot 2024-11-15 at 5 35 26 AM

WOOHOO -- once this gets published it will be time to write a blog post !

@Rachelint
Copy link
Contributor

After profiling, I believe we still can get obvious improvement after finishing #11943 and its related epic!

@pmcgleenon
Copy link
Contributor

It is sad that we are not yet the fastest engine but interestingly it looks like many places where we are the farthest from the top are the really fast queries (that take 100s of ms)

That's an interesting observation. Maybe we can look at benchmarking on c6a.metal at some point - it might highlight some more areas which could be improved and would show how DataFusion scales out to 192 vCPUs

@pmcgleenon
Copy link
Contributor

I've created a PR on ClickBench to update the results ClickHouse/ClickBench#251

Congratulations everyone on another amazing project milestone!

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2024

After profiling, I believe we still can get obvious improvement after finishing #11943 and its related epic!

Indeed. I also wanted to point out that item did not make it into 43.0.0 (it was merged shortly after we made the release candidate) so I expect the 44.0.0 release to be even better

@pmcgleenon
Copy link
Contributor

The ClickBench PR has been merged, so the v43 results are available at https://benchmark.clickhouse.com/

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2024

Thank you so much @pmcgleenon ! I'll file some follow on tickets as well

link

Screenshot 2024-11-15 at 8 04 23 AM

Bam!

I am going to write up a blog post about this as i thnik it is an excellent example of what it takes to make queries fast as well as a great example of team work (ticket for tracking blog: #13436)

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2024

Thanks again @pmcgleenon ❤️

@rluvaton
Copy link
Contributor

How is it compared to polars in performance in this benchmark?

@alamb
Copy link
Contributor Author

alamb commented Nov 22, 2024

How is it compared to polars in performance in this benchmark?

I don't think Polars can run SQL so it can't run this benchmark

@Dandandan
Copy link
Contributor

Polars is included in the benchmark (via in memory DataFrame API) and runs on the biggest machine.
It seems DataFusion is faster on quite some queries even though it reads parquet from disk and runs on a smaller machine.

@Dandandan
Copy link
Contributor

See results

@alamb
Copy link
Contributor Author

alamb commented Nov 22, 2024

See results

This is quite interesting. It seems like Polars may also compute min/max/count/sum during load as some queries take 0 time (but there is 275s of load time).

Still 🎣 this would be an excellent optimization for anyone else working on

@rluvaton
Copy link
Contributor

rluvaton commented Nov 23, 2024

The polars test is in python so I don't think that is a fair comparison

@Dandandan
Copy link
Contributor

Dandandan commented Nov 23, 2024

The polars test is in python so I don't think that is a fair comparison

I don't think that's very relevant, as the execution is done natively (Polars is written in Rust). Likewise, DataFusion also would perform about the same in Python.

Running on the biggest machine / and / or reading everything in memory actually gives Polars an advantage, so I think it’s even more impressive to see the difference here.

I think it mainly shows the different optimizations done last ~2 years to make (tricky/expensive) aggregations fast. It may also show some problems in Polars, e.g. with compiling regexes.

@waruto210
Copy link
Contributor

I'm very pleased to see DataFusion achieving such results. However, I encountered some anomalies while trying to reproduce the benchmark, so I'd like to ask for some guidance.
Following the scripts in the ClickBench repository, I ran ClickBench on partitioned parquet files.
During the cold run phase, DataFusion was about 20% faster than ClickHouse, but in the hot run phase, DataFusion was about 20% slower than ClickHouse.
We used a machine with specifications similar to c6a.4xlarge, featuring 16 vCPUs, 32GB of memory, and an SSD with 2GB/s bandwidth. Additionally, we ran ClickBench on a machine with similar specifications but using HDD, and the results were consistent - DataFusion was slower than ClickHouse in the hot run phase.
This was quite unexpected, and I'd like to know if there might be some configuration/compilation parameters that could be causing this issue.

@alamb I would really appreciate any advice you could give when you have a moment.

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2024

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

@waruto210
Copy link
Contributor

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

For parquet files, ClickHouse uses local mode. In my understanding, in local mode, ClickHouse, like DataFusion, is a stateless query engine with only Linux page cache available. So I'm very surprised by these results. I will conduct more experiments to try to find out the reason.

@waruto210
Copy link
Contributor

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)

DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)

It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

After conducting more experiments, I made some unexpected discoveries:

In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase.

@alamb FYI

@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2024

Thank you for the update @waruto210

@jayzhan211
Copy link
Contributor

@alamb I would really appreciate any advice you could give when you have a moment.

I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc)
DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO)
It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing)

After conducting more experiments, I made some unexpected discoveries:

In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase.

@alamb FYI

Do you know which queries are we still lag behind in the old version of clickhouse?

@waruto210
Copy link
Contributor

waruto210 commented Dec 9, 2024

Do you know which queries are we still lag behind in the old version of clickhouse?

On our test machines, roughly the following queries are slower:

Q4: SELECT count(DISTINCT UserID) FROM hits LIMIT 20000
Q8: SELECT RegionID, count(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10
Q13:SELECT SearchPhrase, count(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10
Q29: SELECT sum(ResolutionWidth), sum(ResolutionWidth + 1), sum(ResolutionWidth + 2), sum(ResolutionWidth + 3), sum(ResolutionWidth + 4), sum(ResolutionWidth + 5), sum(ResolutionWidth + 6), sum(ResolutionWidth + 7), sum(ResolutionWidth + 8), sum(ResolutionWidth + 9), sum(ResolutionWidth + 10), sum(ResolutionWidth + 11), sum(ResolutionWidth + 12), sum(ResolutionWidth + 13), sum(ResolutionWidth + 14), sum(ResolutionWidth + 15), sum(ResolutionWidth + 16), sum(ResolutionWidth + 17), sum(ResolutionWidth + 18), sum(ResolutionWidth + 19), sum(ResolutionWidth + 20), sum(ResolutionWidth + 21), sum(ResolutionWidth + 22), sum(ResolutionWidth + 23), sum(ResolutionWidth + 24), sum(ResolutionWidth + 25), sum(ResolutionWidth + 26), sum(ResolutionWidth + 27), sum(ResolutionWidth + 28), sum(ResolutionWidth + 29), sum(ResolutionWidth + 30), sum(ResolutionWidth + 31), sum(ResolutionWidth + 32), sum(ResolutionWidth + 33), sum(ResolutionWidth + 34), sum(ResolutionWidth + 35), sum(ResolutionWidth + 36), sum(ResolutionWidth + 37), sum(ResolutionWidth + 38), sum(ResolutionWidth + 39), sum(ResolutionWidth + 40), sum(ResolutionWidth + 41), sum(ResolutionWidth + 42), sum(ResolutionWidth + 43), sum(ResolutionWidth + 44), sum(ResolutionWidth + 45), sum(ResolutionWidth + 46), sum(ResolutionWidth + 47), sum(ResolutionWidth + 48), sum(ResolutionWidth + 49), sum(ResolutionWidth + 50), sum(ResolutionWidth + 51), sum(ResolutionWidth + 52), sum(ResolutionWidth + 53), sum(ResolutionWidth + 54), sum(ResolutionWidth + 55), sum(ResolutionWidth + 56), sum(ResolutionWidth + 57), sum(ResolutionWidth + 58), sum(ResolutionWidth + 59), sum(ResolutionWidth + 60), sum(ResolutionWidth + 61), sum(ResolutionWidth + 62), sum(ResolutionWidth + 63), sum(ResolutionWidth + 64), sum(ResolutionWidth + 65), sum(ResolutionWidth + 66), sum(ResolutionWidth + 67), sum(ResolutionWidth + 68), sum(ResolutionWidth + 69), sum(ResolutionWidth + 70), sum(ResolutionWidth + 71), sum(ResolutionWidth + 72), sum(ResolutionWidth + 73), sum(ResolutionWidth + 74), sum(ResolutionWidth + 75), sum(ResolutionWidth + 76), sum(ResolutionWidth + 77), sum(ResolutionWidth + 78), sum(ResolutionWidth + 79), sum(ResolutionWidth + 80), sum(ResolutionWidth + 81), sum(ResolutionWidth + 82), sum(ResolutionWidth + 83), sum(ResolutionWidth + 84), sum(ResolutionWidth + 85), sum(ResolutionWidth + 86), sum(ResolutionWidth + 87), sum(ResolutionWidth + 88), sum(ResolutionWidth + 89) FROM hits LIMIT 20000
Q35:SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, count(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants