-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ClickBench benchmarks with DataFusion 43.0.0
#13099
Comments
43
43.0.0
Hi @pmcgleenon I just double checked and this should be good to go now -- the clickbench runner uses Thank you so much |
take |
Hi @alamb here are the initial results. I've created a composite html file with results for DataFusion releases 33, 34, 36, 40 and 43 so we can compare across the previous releases. Single results Partitioned results Before running the tests I've updated the SQL to include It looks like we've seen some significant gains across a number of queries. Are these results in line with your expectations? If so I can update the Clickbench to update it with the latest results. |
Yes, indeed, @korowa @Rachelint @jayzhan211 @XiangpengHao and many others have been hard at work improving the speed. For an idea of what was done: #12821 (comment) It is sad that we are not yet the fastest engine but interestingly it looks like many places where we are the farthest from the top are the really fast queries (that take 100s of ms) Thank you again @pmcgleenon |
YES! you are totally right @Dandandan With some more finagling, I filtered for only WOOHOO -- once this gets published it will be time to write a blog post ! |
After profiling, I believe we still can get obvious improvement after finishing #11943 and its related epic! |
That's an interesting observation. Maybe we can look at benchmarking on |
I've created a PR on ClickBench to update the results ClickHouse/ClickBench#251 Congratulations everyone on another amazing project milestone! |
Indeed. I also wanted to point out that item did not make it into 43.0.0 (it was merged shortly after we made the release candidate) so I expect the 44.0.0 release to be even better |
The ClickBench PR has been merged, so the v43 results are available at https://benchmark.clickhouse.com/ |
Thank you so much @pmcgleenon ! I'll file some follow on tickets as well Bam! I am going to write up a blog post about this as i thnik it is an excellent example of what it takes to make queries fast as well as a great example of team work (ticket for tracking blog: #13436) |
Thanks again @pmcgleenon ❤️ |
How is it compared to polars in performance in this benchmark? |
I don't think Polars can run SQL so it can't run this benchmark |
Polars is included in the benchmark (via in memory DataFrame API) and runs on the biggest machine. |
This is quite interesting. It seems like Polars may also compute min/max/count/sum during load as some queries take 0 time (but there is 275s of load time). Still 🎣 this would be an excellent optimization for anyone else working on |
The polars test is in python so I don't think that is a fair comparison |
I don't think that's very relevant, as the execution is done natively (Polars is written in Rust). Likewise, DataFusion also would perform about the same in Python. Running on the biggest machine / and / or reading everything in memory actually gives Polars an advantage, so I think it’s even more impressive to see the difference here. I think it mainly shows the different optimizations done last ~2 years to make (tricky/expensive) aggregations fast. It may also show some problems in Polars, e.g. with compiling regexes. |
I'm very pleased to see DataFusion achieving such results. However, I encountered some anomalies while trying to reproduce the benchmark, so I'd like to ask for some guidance. @alamb I would really appreciate any advice you could give when you have a moment. |
I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc) DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO) It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing) |
For parquet files, ClickHouse uses local mode. In my understanding, in local mode, ClickHouse, like DataFusion, is a stateless query engine with only Linux page cache available. So I'm very surprised by these results. I will conduct more experiments to try to find out the reason. |
After conducting more experiments, I made some unexpected discoveries: In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase. @alamb FYI |
Thank you for the update @waruto210 |
Do you know which queries are we still lag behind in the old version of clickhouse? |
On our test machines, roughly the following queries are slower: Q4: SELECT count(DISTINCT UserID) FROM hits LIMIT 20000
Q8: SELECT RegionID, count(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10
Q13:SELECT SearchPhrase, count(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10
Q29: SELECT sum(ResolutionWidth), sum(ResolutionWidth + 1), sum(ResolutionWidth + 2), sum(ResolutionWidth + 3), sum(ResolutionWidth + 4), sum(ResolutionWidth + 5), sum(ResolutionWidth + 6), sum(ResolutionWidth + 7), sum(ResolutionWidth + 8), sum(ResolutionWidth + 9), sum(ResolutionWidth + 10), sum(ResolutionWidth + 11), sum(ResolutionWidth + 12), sum(ResolutionWidth + 13), sum(ResolutionWidth + 14), sum(ResolutionWidth + 15), sum(ResolutionWidth + 16), sum(ResolutionWidth + 17), sum(ResolutionWidth + 18), sum(ResolutionWidth + 19), sum(ResolutionWidth + 20), sum(ResolutionWidth + 21), sum(ResolutionWidth + 22), sum(ResolutionWidth + 23), sum(ResolutionWidth + 24), sum(ResolutionWidth + 25), sum(ResolutionWidth + 26), sum(ResolutionWidth + 27), sum(ResolutionWidth + 28), sum(ResolutionWidth + 29), sum(ResolutionWidth + 30), sum(ResolutionWidth + 31), sum(ResolutionWidth + 32), sum(ResolutionWidth + 33), sum(ResolutionWidth + 34), sum(ResolutionWidth + 35), sum(ResolutionWidth + 36), sum(ResolutionWidth + 37), sum(ResolutionWidth + 38), sum(ResolutionWidth + 39), sum(ResolutionWidth + 40), sum(ResolutionWidth + 41), sum(ResolutionWidth + 42), sum(ResolutionWidth + 43), sum(ResolutionWidth + 44), sum(ResolutionWidth + 45), sum(ResolutionWidth + 46), sum(ResolutionWidth + 47), sum(ResolutionWidth + 48), sum(ResolutionWidth + 49), sum(ResolutionWidth + 50), sum(ResolutionWidth + 51), sum(ResolutionWidth + 52), sum(ResolutionWidth + 53), sum(ResolutionWidth + 54), sum(ResolutionWidth + 55), sum(ResolutionWidth + 56), sum(ResolutionWidth + 57), sum(ResolutionWidth + 58), sum(ResolutionWidth + 59), sum(ResolutionWidth + 60), sum(ResolutionWidth + 61), sum(ResolutionWidth + 62), sum(ResolutionWidth + 63), sum(ResolutionWidth + 64), sum(ResolutionWidth + 65), sum(ResolutionWidth + 66), sum(ResolutionWidth + 67), sum(ResolutionWidth + 68), sum(ResolutionWidth + 69), sum(ResolutionWidth + 70), sum(ResolutionWidth + 71), sum(ResolutionWidth + 72), sum(ResolutionWidth + 73), sum(ResolutionWidth + 74), sum(ResolutionWidth + 75), sum(ResolutionWidth + 76), sum(ResolutionWidth + 77), sum(ResolutionWidth + 78), sum(ResolutionWidth + 79), sum(ResolutionWidth + 80), sum(ResolutionWidth + 81), sum(ResolutionWidth + 82), sum(ResolutionWidth + 83), sum(ResolutionWidth + 84), sum(ResolutionWidth + 85), sum(ResolutionWidth + 86), sum(ResolutionWidth + 87), sum(ResolutionWidth + 88), sum(ResolutionWidth + 89) FROM hits LIMIT 20000
Q35:SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, count(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10 |
Is your feature request related to a problem or challenge?
Like #11567
Requires
Once DataFusion 43.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 40 and there have been significant improvements since then. See for more details:
Describe the solution you'd like
Perhaps we can follow the model of ClickHouse/ClickBench#210 (thanks @pmcgleenon )
We will also need to update DataFusion to apply the new
binary_as_string
option added by @goldmedal in #12816. TLDR is that we need to update the create table statements to have theOPTIONS ('binary_as_string' 'true')
clausehttps://github.com/ClickHouse/ClickBench/blob/main/datafusion/create_partitioned.sql
Note this is the same as the DuckDB runner, as explained in #12788
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: