Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve performance of CASE .. WHEN expressions #703

Merged
merged 3 commits into from
Jul 24, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 22, 2024

Which issue does this PR close?

N/A

Rationale for this change

Two performance optimizations for CASE expressions were recently merged into DataFusion:

These help with two microbenchmarks where we were previously slower than Spark, and also helps with TPC-DS.

Before

TPCDS Micro Benchmarks:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------
case_when_column_or_null                                953           1005          47        302.2           3.3       1.0X
case_when_column_or_null: Comet (Scan)                  988           1011          20        291.5           3.4       1.0X
case_when_column_or_null: Comet (Scan, Exec)           1214           1237          24        237.2           4.2       0.8X

case_when_scalar                                    202            220          18        356.6           2.8       1.0X
case_when_scalar: Comet (Scan)                     1852           1869          19         38.9          25.7       0.1X
case_when_scalar: Comet (Scan, Exec)                378            401          15        190.4           5.3       0.5X

After

TPCDS Micro Benchmarks:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------
case_when_column_or_null                                942            989          41        305.6           3.3       1.0X
case_when_column_or_null: Comet (Scan)                  988           1002          11        291.5           3.4       1.0X
case_when_column_or_null: Comet (Scan, Exec)            857            894          26        336.2           3.0       1.1X

case_when_scalar                                    202            214           8        356.9           2.8       1.0X
case_when_scalar: Comet (Scan)                     1849           1861          16         38.9          25.7       0.1X
case_when_scalar: Comet (Scan, Exec)                201            223          15        357.5           2.8       1.0X

What changes are included in this PR?

Use newer DataFusion version

How are these changes tested?

Existing tests

@andygrove andygrove changed the title Upgrade to DataFusion rev b6e55d7e9 to pick up some CASE optimizations perf: Improve performance of CASE .. WHEN expressions Jul 22, 2024
"agg_low_cardinality",
"agg_sum_decimals_no_grouping",
"agg_sum_integers_no_grouping",
// "add_many_decimals",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to keep these cases commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Thanks for catching that.

@andygrove
Copy link
Member Author

andygrove commented Jul 23, 2024

TPC-DS queries 2, 43, 59, 62, and 99 all benefited from the CASE optimizations. This fragment of a chart shows the speedup of this PR compared to Comet 0.1.0-rc2.

Screenshot from 2024-07-22 19-09-50

@andygrove andygrove merged commit d4a8d68 into apache:main Jul 24, 2024
74 checks passed
@andygrove andygrove deleted the latest-df branch July 24, 2024 15:03
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* latest df

* revert changes to test

(cherry picked from commit d4a8d68)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants