Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench benchmarks with DataFusion 40 #11567

Closed
alamb opened this issue Jul 20, 2024 · 18 comments
Closed

Update ClickBench benchmarks with DataFusion 40 #11567

alamb opened this issue Jul 20, 2024 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

Is your feature request related to a problem or challenge?

Like #9404

DataFusion 40 has been released https://crates.io/crates/datafusion/40.0.0

It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 36

Describe the solution you'd like

Perhaps we can follow the model of ClickHouse/ClickBench#178 (thanks @pmcgleenon ) or ClickHouse/ClickBench#145 (thanks @kmitchener )

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jul 20, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 20, 2024

FWIW I think the benchmarks will improve dramatically once we have completed the StringView work that @XiangpengHao is leading #10918

@xinlifoobar
Copy link
Contributor

take

@xinlifoobar
Copy link
Contributor

Sorry @alamb, I just found out that I could not create AWS account at this time. Is it fine to use Azure VM, e.g., F16sv2, instead? If not please unassign me...

image

@alamb
Copy link
Contributor Author

alamb commented Jul 25, 2024

I am not sure -- maybe @pmcgleenon could comment? I don't know if this is equivalent to the AWS machines

@pmcgleenon
Copy link
Contributor

From what I understand, the previous datafusion ClickBench runs have been on this AWS EC2 instance:

  • c6a.4xlarge
  • Amazon Linux 2 AMI
  • Root 500GB gp2 SSD
  • no EBS optimized
  • no instance store

If we want to compare datafusion performance across different datafusion versions (and spot any improvements/degradations) then sticking to the same machine spec will allow us to do this.

If you check the Clickbench results many other databases also publish their results for the c6a.4xlarge AWS EC2 instance so we can compare datafusion results with DuckDB, ClickHouse, QuestDB and with many more.

IMO we should stick with the same AWS instance for future runs.

@alamb @xinlifoobar I'm happy to help out here if required!

@pmcgleenon
Copy link
Contributor

pmcgleenon commented Jul 25, 2024

by the way this is what I found comparing AWS c6a.4xlarge and Azure Standard_F16s_v2. Looks like the CPU clock speed is different and there are some differences in the storage performance numbers

@alamb
Copy link
Contributor Author

alamb commented Jul 25, 2024

@alamb @xinlifoobar I'm happy to help out here if required!

That would be amazing 🙏

@pmcgleenon
Copy link
Contributor

@alamb @xinlifoobar Here are the results for df40 (attached is a file comparing 33, 34, 36 and 40)

Single
Screenshot 2024-07-28 at 14 26 24

Partitioned
Screenshot 2024-07-28 at 14 26 09

df40.zip

Are these results inline with your expectations?

If so I can create a PR on Clickbench to update the datafusion results

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

Are these results inline with your expectations?

I would expect that we didn't see much performance difference between 40 and the other versions as we haven't done much on query performance recently.

It appears that the really low latency queries having gotten slower (perhaps due to increasing overhead in planning or runtime somewhere).

If these results are reproduceable, I do think we should publish them to clickbench

Thank you @pmcgleenon and @xinlifoobar

@pmcgleenon
Copy link
Contributor

Thanks @alamb I'll create a PR on the Clickbench repo to update the results

@pmcgleenon
Copy link
Contributor

I've opened this PR for the Datafusion 40 results ClickHouse/ClickBench#210

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

It appears that the really low latency queries having gotten slower (perhaps due to increasing overhead in planning or runtime somewhere).

BTW I plan to spend some time tomorrow organizing / profiling the results to see if I can find some additional improvements to make.

@pmcgleenon
Copy link
Contributor

That sounds great!

One benefit of the Clickbench results is that we can easily compare with other projects. Overall datafusion (partitioned) seems competitive with ClickHouse and DuckDB on similar hardware. In some cases the datafusion performance is better.
The results highlight a couple of scenarios where datafusion is behind (e.g. Q18, Q29, Q39 etc).
It will be exciting to see how the StringView work and any other improvements will change this picture in the future!

Screenshot 2024-07-29 at 23 08 02

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

I expect StringView to help as well as @korowa 's #11627

In case anyone else is interested, I did an analysis of the various benchmark query properties here:
https://docs.google.com/spreadsheets/d/1NZuh_dEs9gX5uEp8AQ3DfvkNfC6bXFayQtFkjeKjNxQ/edit?gid=0#gid=0

(e.g. that is how I determine the relative cardinalities / what types of queries)

Screenshot 2024-07-30 at 5 55 37 AM

I think Q16/Q17/Q18 are all "high cardinality aggregates with mutli-column group keys"

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

I looked at some short queries and found one potential improvement #11719

I also looked at Q38

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
$ cargo run --release --bin dfbench -- clickbench --iterations 100 --path benchmarks/data/hits_partitioned  --query 38

More than 50% of the time is spent doing snappy decoding (which we aren't likely to be able to improve)

Screenshot 2024-07-30 at 6 40 44 AM

12% of the time is reading string data from parquet (maybe stringview will help)
10% of the time is spent decoding parquet metadata

Screenshot 2024-07-30 at 6 44 17 AM

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

I am pretty sure Q18 would be helped with #9403 -- maybe we'll find a way to do that shortly

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

@pmcgleenon
Copy link
Contributor

Hi @alamb

the Clickbench PR has been merged

the Datafusion version 40 results are now visible on the main ClickBench page https://benchmark.clickhouse.com/

Screenshot 2024-08-02 at 14 44 33

@alamb alamb closed this as completed Aug 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2024

Thank you so much @pmcgleenon 🙏 -- I am pretty excited to complte our inprogress work (like stringview and high cardinality aggregates) and run these again with a newer version of DataFusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants