-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ClickBench benchmarks with DataFusion 40 #11567
Comments
FWIW I think the benchmarks will improve dramatically once we have completed the StringView work that @XiangpengHao is leading #10918 |
take |
Sorry @alamb, I just found out that I could not create AWS account at this time. Is it fine to use Azure VM, e.g., F16sv2, instead? If not please unassign me... |
I am not sure -- maybe @pmcgleenon could comment? I don't know if this is equivalent to the AWS machines |
From what I understand, the previous datafusion ClickBench runs have been on this AWS EC2 instance:
If we want to compare datafusion performance across different datafusion versions (and spot any improvements/degradations) then sticking to the same machine spec will allow us to do this. If you check the Clickbench results many other databases also publish their results for the IMO we should stick with the same AWS instance for future runs. @alamb @xinlifoobar I'm happy to help out here if required! |
by the way this is what I found comparing AWS
|
That would be amazing 🙏 |
@alamb @xinlifoobar Here are the results for df40 (attached is a file comparing 33, 34, 36 and 40) Are these results inline with your expectations? If so I can create a PR on Clickbench to update the datafusion results |
I would expect that we didn't see much performance difference between 40 and the other versions as we haven't done much on query performance recently. It appears that the really low latency queries having gotten slower (perhaps due to increasing overhead in planning or runtime somewhere). If these results are reproduceable, I do think we should publish them to clickbench Thank you @pmcgleenon and @xinlifoobar |
Thanks @alamb I'll create a PR on the Clickbench repo to update the results |
I've opened this PR for the Datafusion 40 results ClickHouse/ClickBench#210 |
BTW I plan to spend some time tomorrow organizing / profiling the results to see if I can find some additional improvements to make. |
I expect StringView to help as well as @korowa 's #11627 In case anyone else is interested, I did an analysis of the various benchmark query properties here: (e.g. that is how I determine the relative cardinalities / what types of queries) I think Q16/Q17/Q18 are all "high cardinality aggregates with mutli-column group keys" |
I looked at some short queries and found one potential improvement #11719 I also looked at Q38 SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; $ cargo run --release --bin dfbench -- clickbench --iterations 100 --path benchmarks/data/hits_partitioned --query 38 More than 50% of the time is spent doing snappy decoding (which we aren't likely to be able to improve) 12% of the time is reading string data from parquet (maybe stringview will help) |
I am pretty sure Q18 would be helped with #9403 -- maybe we'll find a way to do that shortly SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; |
Hi @alamb the Clickbench PR has been merged the Datafusion version 40 results are now visible on the main ClickBench page https://benchmark.clickhouse.com/ |
Thank you so much @pmcgleenon 🙏 -- I am pretty excited to complte our inprogress work (like stringview and high cardinality aggregates) and run these again with a newer version of DataFusion |
Is your feature request related to a problem or challenge?
Like #9404
DataFusion 40 has been released https://crates.io/crates/datafusion/40.0.0
It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 36
Describe the solution you'd like
Perhaps we can follow the model of ClickHouse/ClickBench#178 (thanks @pmcgleenon ) or ClickHouse/ClickBench#145 (thanks @kmitchener )
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: