Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CASE expression for "expr or expr" usage. #13953

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aweltsch
Copy link

Which issue does this PR close?

Closes #11638.

Rationale for this change

The objective of this PR is to optimize the case_when: expr or expr benchmark. I measured a small but consistent improvement of around 10% on this benchmark.

What changes are included in this PR?

I implemented an additional evaluation method to improve CASE WHEN condition THEN expr ELSE expr performance.
The implementation is supposed to be very close to the existing implementation of more general cases.

Are these changes tested?

I added a basic test case for the new evaluation method.

Are there any user-facing changes?

No, the changes should not affect the semantics of the CASE expression.

@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 31, 2024
@alamb
Copy link
Contributor

alamb commented Jan 1, 2025

Thanks @aweltsch -- this looks quite nice. I am running the benchmarks on my test rig to verify

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aweltsch -- this looks good and is a very nice first contribution

I think the only think it is missing is some end to end tests (.slt)

The instructions for adding such tests is here
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md

Perhaps you can extend
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/case.slt

My benchmark run shows about a 25% performance improvement. Nice work 🚀

++ critcmp main GH-11638-inner-loop
group                          GH-11638-inner-loop                    main
-----                          -------------------                    ----
case_when: CASE expr           1.01     24.4±0.35µs        ? ?/sec    1.00     24.3±0.28µs        ? ?/sec
case_when: column or null      1.01   1493.9±5.22ns        ? ?/sec    1.00   1481.8±3.38ns        ? ?/sec
case_when: expr or expr        1.00     31.6±0.13µs        ? ?/sec    1.24     39.3±1.34µs        ? ?/sec
case_when: scalar or scalar    1.02      8.5±0.02µs        ? ?/sec    1.00      8.3±0.02µs        ? ?/sec

Comment on lines +412 to +415
DataFusionError::Context(
"WHEN expression did not return a BooleanArray".to_string(),
Box::new(e),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusionError::Context(
"WHEN expression did not return a BooleanArray".to_string(),
Box::new(e),
)
internal_datafusion_err!("WHEN expression did not return a BooleanArray")

nit: We can assume all type checks have been done before, then inside this function all cast failures should be unreachable, so we can use internal error instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 for the feedback. I was not aware that all of the type-checking is guaranteed at this point.
One of my main motivations to have this here was to keep it consistent with the rest of the code in the file to minimize any deviation from the previous behavior. I can apply this change for the newly added code, what should happen to the rest of the code? Do you think it would make sense to add a new issue to clean-up the other functions to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can keep the code consistent now, and do clean-up later if possible

let e = self.else_expr.as_ref().unwrap();
// keep `else_expr`'s data type and return type consistent
let expr = try_cast(Arc::clone(e), &batch.schema(), return_type.clone())
.unwrap_or_else(|_| Arc::clone(e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is similar, we can return an internal error directly, and avoid propagating the casting failure

Copy link
Author

@aweltsch aweltsch Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is also used in all of the other evaluation methods for the CaseExpr I would also like to include this in the same clean-up issue. Would this be fine for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great, thank you

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 2, 2025
@aweltsch
Copy link
Author

aweltsch commented Jan 2, 2025

Thanks for your feedback @alamb, I have added a new .slt test case in the file you mentioned. From my POV it should cover all relevant cases for the predicate (true, false, null) with proper expressions in the branches.

@aweltsch
Copy link
Author

aweltsch commented Jan 2, 2025

I added a follow-up issue #13990
I hope it is worded clearly and accurately reflects the changes desired. @2010YOUY01 feel free to chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize CASE expression for "expr or expr" usage
3 participants