Fix polars #270

ritchie46 · 2024-11-25T12:49:57Z

Hey, I am from the Polars project and was surprised Polars was added to ClickBench. When I looked at the implementation I saw that Polars was incorrectly used (#268). and sometimes even python functions were called where we have a native API for the operations.

This ensures that

Polars' queries can use the Lazy API and thus the optimizer and the query engine.
The statistics are computed in single query (similar to SQL). That ensures we can parallelize
Use Polars expressions and not map_elements with python udfs

Resolves: #268

CLAassistant · 2024-11-25T12:50:03Z

All committers have signed the CLA.

ritchie46 · 2024-11-25T12:50:27Z

polars/query.py

    ),
    (
        "Q2",
        "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;",
        lambda x: (x["AdvEngineID"].sum(), x.shape[0], x["ResolutionWidth"].mean()),
-        lambda x: (x["AdvEngineID"].sum(), x.height, x["ResolutionWidth"].mean()),


These are 2 queries, where in SQL it is in a single query.

ritchie46 · 2024-11-25T12:51:58Z

polars/query.py

    ),
    (
        "Q3",
        "SELECT AVG(UserID) FROM hits;",
        lambda x: x["UserID"].mean(),
-        lambda x: x["UserID"].mean(),


This is not lazy suitable. Forces a single engine. It is meant for notebooks.

ritchie46 · 2024-11-25T12:52:45Z

polars/query.py

@@ -318,15 +318,13 @@
        .group_by("CounterID")  # GROUP BY CounterID
        .agg(
            [
-                pl.col("URL")
-                .map_elements(lambda y: len(y), return_dtype=pl.Int64)


This was wrong. Didn't compute the AVG and it uses a python function, where we have native expressions for this.

ritchie46 · 2024-11-25T12:52:49Z

polars/query.py

@@ -352,18 +350,14 @@
            .group_by("k")
            .agg(
                [
-                    pl.col("Referer").map_elements(
-                        lambda y: len(y), return_dtype=pl.Int64


This was wrong. Didn't compute the AVG and it uses a python function, where we have native expressions for this.

ritchie46 · 2024-11-25T12:55:41Z

polars/query.py

 stop = timeit.default_timer()
 load_time = stop - start

 # 0: No., 1: SQL, 2: Pandas, 3: Polars
-queries = queries = [


Don't know why this was.

ritchie46 · 2024-11-26T04:57:58Z

@rschu1ze Yeah, that's a pandas requirement. I don't know why there is pandas in the Polars benchmark anyway. 😅

I went ahead and removed the pandas code here. We can load parquet with Polars directly without an extra pandas or pyarrow requirement. I verified I can run (on a sliced dataset) locally. The total dataset is too big for my machine.

If there is interest I can follow up with a PR where we query directly from parquet/csv or other sources.

ritchie46 · 2024-11-26T07:33:45Z

polars/query.py

 import json

-hits = pd.read_parquet("hits.parquet")


We can load parquet with Polars. No need to go via pandas. I also think that load time should reflect loading from disk instead of loading from pandas. Loading from pandas is mostly zero-copy.

ritchie46 · 2024-11-26T07:50:09Z

polars/query.py

-            )
-            .head(25)
-        ),
+        "SELECT REGEXP_REPLACE(Referer, '(?-u)^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(STRLEN(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;",


Polars uses to rust Regex which defaults to unicode supported regex, whereas most engines don't. This can have huge search cost differences:

pola-rs/polars#5613

rluvaton · 2024-11-26T12:43:46Z

If there is interest I can follow up with a PR where we query directly from parquet/csv or other sources.

Yes! and if you can run the benchmark c6a.4xlarge, 500gb gp2 (like in datafusion) it would be great!

You will need to set up an AWS EC2 machine with the desired instance type (see here). During setup, go with the defaults.

Then clone the ClickBench repository and run polars/benchmark.sh resp. datafusion/benchmarks.sh.

Originally posted by @rschu1ze in #264 (comment)

rschu1ze · 2024-11-26T12:55:19Z

Yes! and if you can run the benchmark c6a.4xlarge, 500gb gp2 (like in datafusion) it would be great!

This relates to #264. I actually tried to run polars/benchmark.sh on my corporate c6a.4xlarge machine (using the version in this PR, now that the setup is fixed) ... and it froze my machine so badly that I was unable to connect via SSH or ping it for 2.5 hours, after which I killed the instance forcefully. Perhaps someone else has more luck.

ritchie46 · 2024-11-26T15:18:55Z

and it froze my machine so badly that I was unable to connect via SSH or ping it for 2.5 hours,

Yes, this version loads all data in memory which doesn't fit that machine. This was already the case, and I didn't adapt that in the PR.

I can also follow up with a PR where we load data from parquet. That doesn't materialize all data and is much less memory intensive.

rschu1ze · 2024-11-27T20:42:49Z

I can also follow up with a PR where we load data from parquet. That doesn't materialize all data and is much less memory intensive.

That would be nice. Actually, we should try to do changes to the benchmark scripts / queries and the results themselves in the same PR. Otherwise, one of them will be outdated. So I guess my preference is if we could update all polars variants within this PR (and if it helps, I can setup myself a c6a.metal box which is way more powerful than c6a.4xlarge and try to redo the measurements).

ritchie46 · 2024-11-28T14:28:32Z

@rschu1ze I have added parquet as a source as well. Made it so that we create 2 output jsons, indicating the source used.

Note that running from parquet has drastically reduced memory usage. The DataFrame case is just very intensive as we are forced to load all data in memory.

rschu1ze · 2024-11-28T14:53:59Z

@ritchie46 Thanks ... I'll re-run measurements in the coming days!

ritchie46 · 2024-11-28T15:00:21Z

Great, thanks!

rschu1ze · 2024-11-29T11:43:43Z

I both dataframes and parquet versions on c6a.4xlarge again but the system hang. Didn't check deeper why, instead I repeated the measurements on a c6a.metal machine. Pushed the measurements into this PR.

alexey-milovidov · 2024-12-15T04:34:05Z

@rschu1ze, the server's freeze means OOM. Please run it with prlimit as here:
https://github.com/ClickHouse/ClickBench/pull/267/files#diff-6ad3e2e988a332b62ba8cc145121a402e3372f9414784d7c9a93e7b6967e8be6R11

Some queries will be shown as failed, but the results are still good to publish (and then they can be improved).

rschu1ze · 2024-12-15T20:44:00Z

Done: #279

ritchie46 added 2 commits November 25, 2024 13:44

fix: Fix Polars queries

1b5630d

single query

77c1057

ritchie46 commented Nov 25, 2024

View reviewed changes

pass lazy

72d910b

ritchie46 commented Nov 25, 2024

View reviewed changes

rschu1ze self-assigned this Nov 25, 2024

This comment was marked as resolved.

Sign in to view

ritchie46 marked this pull request as draft November 26, 2024 07:12

remove pandas and fix all queries

bd16ac3

ritchie46 marked this pull request as ready for review November 26, 2024 07:27

always update to latest

5b262bf

ritchie46 commented Nov 26, 2024

View reviewed changes

fix all queries

d12bfb6

ritchie46 force-pushed the fix_polars branch from 28fbc0b to d12bfb6 Compare November 26, 2024 07:41

don't do unicode regex

f0d1891

ritchie46 commented Nov 26, 2024

View reviewed changes

undo slice

c461342

This comment was marked as resolved.

Sign in to view

add parquet source

5b51092

Add results for c6a.metal

1d74427

rschu1ze merged commit 472df79 into ClickHouse:main Nov 29, 2024

rschu1ze mentioned this pull request Nov 29, 2024

Polars vs DataFusion benchmark results #264

Closed

rschu1ze mentioned this pull request Dec 15, 2024

Add Polars (Parquet) on c6a.4xlarge #279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix polars #270

Fix polars #270

ritchie46 commented Nov 25, 2024 •

edited by rschu1ze

Loading

CLAassistant commented Nov 25, 2024 •

edited

Loading

ritchie46 Nov 25, 2024

ritchie46 Nov 25, 2024

ritchie46 Nov 25, 2024

ritchie46 Nov 25, 2024

ritchie46 Nov 25, 2024

This comment was marked as resolved.

ritchie46 commented Nov 26, 2024 •

edited

Loading

ritchie46 Nov 26, 2024

ritchie46 Nov 26, 2024

rluvaton commented Nov 26, 2024 •

edited

Loading

rschu1ze commented Nov 26, 2024 •

edited

Loading

ritchie46 commented Nov 26, 2024

rschu1ze commented Nov 27, 2024

This comment was marked as resolved.

This comment was marked as resolved.

ritchie46 commented Nov 28, 2024

rschu1ze commented Nov 28, 2024

ritchie46 commented Nov 28, 2024

rschu1ze commented Nov 29, 2024

alexey-milovidov commented Dec 15, 2024

rschu1ze commented Dec 15, 2024

Fix polars #270

Fix polars #270

Conversation

ritchie46 commented Nov 25, 2024 • edited by rschu1ze Loading

CLAassistant commented Nov 25, 2024 • edited Loading

ritchie46 Nov 25, 2024

Choose a reason for hiding this comment

ritchie46 Nov 25, 2024

Choose a reason for hiding this comment

ritchie46 Nov 25, 2024

Choose a reason for hiding this comment

ritchie46 Nov 25, 2024

Choose a reason for hiding this comment

ritchie46 Nov 25, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

ritchie46 commented Nov 26, 2024 • edited Loading

ritchie46 Nov 26, 2024

Choose a reason for hiding this comment

ritchie46 Nov 26, 2024

Choose a reason for hiding this comment

rluvaton commented Nov 26, 2024 • edited Loading

rschu1ze commented Nov 26, 2024 • edited Loading

ritchie46 commented Nov 26, 2024

rschu1ze commented Nov 27, 2024

This comment was marked as resolved.

This comment was marked as resolved.

ritchie46 commented Nov 28, 2024

rschu1ze commented Nov 28, 2024

ritchie46 commented Nov 28, 2024

rschu1ze commented Nov 29, 2024

alexey-milovidov commented Dec 15, 2024

rschu1ze commented Dec 15, 2024

ritchie46 commented Nov 25, 2024 •

edited by rschu1ze

Loading

CLAassistant commented Nov 25, 2024 •

edited

Loading

ritchie46 commented Nov 26, 2024 •

edited

Loading

rluvaton commented Nov 26, 2024 •

edited

Loading

rschu1ze commented Nov 26, 2024 •

edited

Loading