-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CASE expression for "expr or expr" usage. #13953
base: main
Are you sure you want to change the base?
Conversation
Thanks @aweltsch -- this looks quite nice. I am running the benchmarks on my test rig to verify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aweltsch -- this looks good and is a very nice first contribution
I think the only think it is missing is some end to end tests (.slt)
The instructions for adding such tests is here
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md
Perhaps you can extend
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/case.slt
My benchmark run shows about a 25% performance improvement. Nice work 🚀
++ critcmp main GH-11638-inner-loop
group GH-11638-inner-loop main
----- ------------------- ----
case_when: CASE expr 1.01 24.4±0.35µs ? ?/sec 1.00 24.3±0.28µs ? ?/sec
case_when: column or null 1.01 1493.9±5.22ns ? ?/sec 1.00 1481.8±3.38ns ? ?/sec
case_when: expr or expr 1.00 31.6±0.13µs ? ?/sec 1.24 39.3±1.34µs ? ?/sec
case_when: scalar or scalar 1.02 8.5±0.02µs ? ?/sec 1.00 8.3±0.02µs ? ?/sec
DataFusionError::Context( | ||
"WHEN expression did not return a BooleanArray".to_string(), | ||
Box::new(e), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusionError::Context( | |
"WHEN expression did not return a BooleanArray".to_string(), | |
Box::new(e), | |
) | |
internal_datafusion_err!("WHEN expression did not return a BooleanArray") |
nit: We can assume all type checks have been done before, then inside this function all cast failures should be unreachable, so we can use internal error instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2010YOUY01 for the feedback. I was not aware that all of the type-checking is guaranteed at this point.
One of my main motivations to have this here was to keep it consistent with the rest of the code in the file to minimize any deviation from the previous behavior. I can apply this change for the newly added code, what should happen to the rest of the code? Do you think it would make sense to add a new issue to clean-up the other functions to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can keep the code consistent now, and do clean-up later if possible
let e = self.else_expr.as_ref().unwrap(); | ||
// keep `else_expr`'s data type and return type consistent | ||
let expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone()) | ||
.unwrap_or_else(|_| Arc::clone(e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is similar, we can return an internal error directly, and avoid propagating the casting failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is also used in all of the other evaluation methods for the CaseExpr
I would also like to include this in the same clean-up issue. Would this be fine for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be great, thank you
Thanks for your feedback @alamb, I have added a new .slt test case in the file you mentioned. From my POV it should cover all relevant cases for the predicate (true, false, null) with proper expressions in the branches. |
I added a follow-up issue #13990 |
Which issue does this PR close?
Closes #11638.
Rationale for this change
The objective of this PR is to optimize the
case_when: expr or expr
benchmark. I measured a small but consistent improvement of around 10% on this benchmark.What changes are included in this PR?
I implemented an additional evaluation method to improve
CASE WHEN condition THEN expr ELSE expr
performance.The implementation is supposed to be very close to the existing implementation of more general cases.
Are these changes tested?
I added a basic test case for the new evaluation method.
Are there any user-facing changes?
No, the changes should not affect the semantics of the CASE expression.