Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Optimize for CASE WHEN .. THEN column ELSE null END #672

Closed
wants to merge 23 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 16, 2024

Which issue does this PR close?

Part of #669

Rationale for this change

CaseExpr is an expensive expression and there are some cases where can use specialized expressions instead.

For the usage CASE WHEN predicate THEN column ELSE null END which is seen in numerous TPC-DS queries, we can just evaluate the column and then change the null mask based on the predicate, which is a much faster operation (more than 10x faster than using the standard case expression).

What changes are included in this PR?

How are these changes tested?

I manually tested with this query.

select
    sum(case when (ss_quantity=1) then ss_sales_price else null end) sun_sales,
    sum(case when (ss_quantity=2) then ss_sales_price else null end) mon_sales,
    sum(case when (ss_quantity=3) then ss_sales_price else null end) tue_sales,
    sum(case when (ss_quantity=4) then ss_sales_price else null end) wed_sales,
    sum(case when (ss_quantity=5) then ss_sales_price else null end) thu_sales,
    sum(case when (ss_quantity=6) then ss_sales_price else null end) fri_sales,
    sum(case when (ss_quantity=7) then ss_sales_price else null end) sat_sales
from store_sales;

Here are the timings in seconds for five iterations.

Spark

[2.5691187381744385, 1.3875913619995117, 1.2080309391021729, 1.4287035465240479, 1.2525527477264404]

Comet before this PR

[4.076025724411011, 2.6956324577331543, 2.5712761878967285, 2.5598058700561523, 2.5548157691955566]

Comet with this PR

[3.372154951095581, 1.8503100872039795, 1.7506794929504395, 1.722181797027588, 1.696714162826538]

@andygrove andygrove changed the title Expr or null feat: Optimize for CASE WHEN predicate THEN expr ELSE null END Jul 16, 2024
@andygrove andygrove closed this Jul 16, 2024
@andygrove andygrove reopened this Jul 16, 2024
@andygrove
Copy link
Member Author

I also see a ~5% performance improvement with TPC-DS q43 @ sf=100

@andygrove andygrove changed the title feat: Optimize for CASE WHEN predicate THEN expr ELSE null END feat: Optimize for CASE WHEN .. THEN column ELSE null END Jul 16, 2024
@andygrove andygrove marked this pull request as ready for review July 16, 2024 07:45
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.74%. Comparing base (de8c55e) to head (49e53d8).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #672      +/-   ##
============================================
+ Coverage     33.69%   33.74%   +0.04%     
+ Complexity      840      835       -5     
============================================
  Files           109      109              
  Lines         42527    42529       +2     
  Branches       9343     9343              
============================================
+ Hits          14331    14352      +21     
+ Misses        25245    25211      -34     
- Partials       2951     2966      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove
Copy link
Member Author

@parthchandra @kazuyukitanimura @huaxingao @viirya This is ready for review now

@@ -541,6 +542,24 @@ impl PhysicalPlanner {
Some(self.create_expr(case_when.else_expr.as_ref().unwrap(), input_schema)?)
}
};

// optimized path for CASE WHEN predicate THEN expr ELSE null END
if else_phy_expr.is_none() && when_then_pairs.len() == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we have added this as an optimization in DF itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed apache/datafusion#11484 to track up streaming this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion PR: apache/datafusion#11534

@andygrove andygrove requested a review from comphead July 19, 2024 19:26
@andygrove
Copy link
Member Author

Let's just wait to get these optimizations from upstream DataFusion. I can test against the lastest DF code when benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants