Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB: Add in-memory results #274

Merged
merged 1 commit into from
Nov 28, 2024

Conversation

szarnyasg
Copy link
Contributor

This PR add DuckDB v1.1.3 in-memory results for a run on a c6a.metal instance.

end = timeit.default_timer()
print(end - start)

with open('queries.sql', 'r') as file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ClickBench repository,

  • in duckdb-memory/, the load and query steps are combined (this file), whereas
  • in duckdb/, the load and query steps are split (load.py and query.py).

I guess it would be easier to figure out the difference between both variants if this is made consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschu1ze Thanks for the feedback!

The reason the duckdb-memory implementation combines load and queries in a single script is that DuckDB runs in-process so if there is no on-disk persistence, we have to run the load and the queries in the same process to keep the data in memory.

There are many DuckDB implementations for ClickBench now, so unifying them would make sense. The likely approach will be using a simple Python script that uses systems calls (such as https://github.com/ClickHouse/ClickBench/pull/274/files#diff-1d8a1d4c9a1a7c7e98e5ac68a2ad1c33b9b7125f482922a040026bfbc2976cffR27) to enforce cache eviction. Would this approach work for the duckdb/ implementations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that approach would work and a unification would generally make sense. But let's do this separately (--> new PR).

end = timeit.default_timer()
print(end - start)

with open('queries.sql', 'r') as file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that approach would work and a unification would generally make sense. But let's do this separately (--> new PR).

@rschu1ze
Copy link
Member

I was able to reproduce the measurements on an 'c6a.metal' instance - thanks.

@rschu1ze rschu1ze merged commit 25d51fa into ClickHouse:main Nov 28, 2024
@szarnyasg szarnyasg deleted the duckdb-v1.1.3-in-memory branch November 29, 2024 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants