Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix polars #270

Merged
merged 10 commits into from
Nov 29, 2024
Merged

Fix polars #270

merged 10 commits into from
Nov 29, 2024

Conversation

ritchie46
Copy link
Contributor

@ritchie46 ritchie46 commented Nov 25, 2024

Hey, I am from the Polars project and was surprised Polars was added to ClickBench. When I looked at the implementation I saw that Polars was incorrectly used (#268). and sometimes even python functions were called where we have a native API for the operations.

This ensures that

  • Polars' queries can use the Lazy API and thus the optimizer and the query engine.
  • The statistics are computed in single query (similar to SQL). That ensures we can parallelize
  • Use Polars expressions and not map_elements with python udfs

Resolves: #268

@CLAassistant
Copy link

CLAassistant commented Nov 25, 2024

CLA assistant check
All committers have signed the CLA.

),
(
"Q2",
"SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;",
lambda x: (x["AdvEngineID"].sum(), x.shape[0], x["ResolutionWidth"].mean()),
lambda x: (x["AdvEngineID"].sum(), x.height, x["ResolutionWidth"].mean()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are 2 queries, where in SQL it is in a single query.

),
(
"Q3",
"SELECT AVG(UserID) FROM hits;",
lambda x: x["UserID"].mean(),
lambda x: x["UserID"].mean(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not lazy suitable. Forces a single engine. It is meant for notebooks.

@@ -318,15 +318,13 @@
.group_by("CounterID") # GROUP BY CounterID
.agg(
[
pl.col("URL")
.map_elements(lambda y: len(y), return_dtype=pl.Int64)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was wrong. Didn't compute the AVG and it uses a python function, where we have native expressions for this.

@@ -352,18 +350,14 @@
.group_by("k")
.agg(
[
pl.col("Referer").map_elements(
lambda y: len(y), return_dtype=pl.Int64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was wrong. Didn't compute the AVG and it uses a python function, where we have native expressions for this.

stop = timeit.default_timer()
load_time = stop - start

# 0: No., 1: SQL, 2: Pandas, 3: Polars
queries = queries = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why this was.

@rschu1ze rschu1ze self-assigned this Nov 25, 2024
@rschu1ze

This comment was marked as resolved.

@ritchie46
Copy link
Contributor Author

ritchie46 commented Nov 26, 2024

@rschu1ze Yeah, that's a pandas requirement. I don't know why there is pandas in the Polars benchmark anyway. 😅

I went ahead and removed the pandas code here. We can load parquet with Polars directly without an extra pandas or pyarrow requirement. I verified I can run (on a sliced dataset) locally. The total dataset is too big for my machine.

If there is interest I can follow up with a PR where we query directly from parquet/csv or other sources.

@ritchie46 ritchie46 marked this pull request as draft November 26, 2024 07:12
@ritchie46 ritchie46 marked this pull request as ready for review November 26, 2024 07:27
import json

hits = pd.read_parquet("hits.parquet")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can load parquet with Polars. No need to go via pandas. I also think that load time should reflect loading from disk instead of loading from pandas. Loading from pandas is mostly zero-copy.

)
.head(25)
),
"SELECT REGEXP_REPLACE(Referer, '(?-u)^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(STRLEN(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars uses to rust Regex which defaults to unicode supported regex, whereas most engines don't. This can have huge search cost differences:

pola-rs/polars#5613

@rluvaton
Copy link

rluvaton commented Nov 26, 2024

If there is interest I can follow up with a PR where we query directly from parquet/csv or other sources.

Yes! and if you can run the benchmark c6a.4xlarge, 500gb gp2 (like in datafusion) it would be great!

You will need to set up an AWS EC2 machine with the desired instance type (see here). During setup, go with the defaults.

Then clone the ClickBench repository and run polars/benchmark.sh resp. datafusion/benchmarks.sh.

Originally posted by @rschu1ze in #264 (comment)

@rschu1ze
Copy link
Member

rschu1ze commented Nov 26, 2024

Yes! and if you can run the benchmark c6a.4xlarge, 500gb gp2 (like in datafusion) it would be great!

This relates to #264. I actually tried to run polars/benchmark.sh on my corporate c6a.4xlarge machine (using the version in this PR, now that the setup is fixed) ... and it froze my machine so badly that I was unable to connect via SSH or ping it for 2.5 hours, after which I killed the instance forcefully. Perhaps someone else has more luck.

@ritchie46
Copy link
Contributor Author

and it froze my machine so badly that I was unable to connect via SSH or ping it for 2.5 hours,

Yes, this version loads all data in memory which doesn't fit that machine. This was already the case, and I didn't adapt that in the PR.

I can also follow up with a PR where we load data from parquet. That doesn't materialize all data and is much less memory intensive.

@rschu1ze
Copy link
Member

I can also follow up with a PR where we load data from parquet. That doesn't materialize all data and is much less memory intensive.

That would be nice. Actually, we should try to do changes to the benchmark scripts / queries and the results themselves in the same PR. Otherwise, one of them will be outdated. So I guess my preference is if we could update all polars variants within this PR (and if it helps, I can setup myself a c6a.metal box which is way more powerful than c6a.4xlarge and try to redo the measurements).

@rschu1ze

This comment was marked as resolved.

@ritchie46

This comment was marked as resolved.

@ritchie46
Copy link
Contributor Author

@rschu1ze I have added parquet as a source as well. Made it so that we create 2 output jsons, indicating the source used.

Note that running from parquet has drastically reduced memory usage. The DataFrame case is just very intensive as we are forced to load all data in memory.

@rschu1ze
Copy link
Member

@ritchie46 Thanks ... I'll re-run measurements in the coming days!

@ritchie46
Copy link
Contributor Author

Great, thanks!

@rschu1ze
Copy link
Member

I both dataframes and parquet versions on c6a.4xlarge again but the system hang. Didn't check deeper why, instead I repeated the measurements on a c6a.metal machine. Pushed the measurements into this PR.

@rschu1ze rschu1ze merged commit 472df79 into ClickHouse:main Nov 29, 2024
@alexey-milovidov
Copy link
Member

@rschu1ze, the server's freeze means OOM. Please run it with prlimit as here:
https://github.com/ClickHouse/ClickBench/pull/267/files#diff-6ad3e2e988a332b62ba8cc145121a402e3372f9414784d7c9a93e7b6967e8be6R11

Some queries will be shown as failed, but the results are still good to publish (and then they can be improved).

@rschu1ze
Copy link
Member

Done: #279

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polars benchmark are very suboptimal.
5 participants