-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Optimize CASE expression for "column or null" use case #11534
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,10 +32,29 @@ use datafusion_common::cast::as_boolean_array; | |
use datafusion_common::{exec_err, internal_err, DataFusionError, Result, ScalarValue}; | ||
use datafusion_expr::ColumnarValue; | ||
|
||
use datafusion_physical_expr_common::expressions::column::Column; | ||
use itertools::Itertools; | ||
|
||
type WhenThen = (Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>); | ||
|
||
#[derive(Debug, Hash)] | ||
enum EvalMethod { | ||
/// CASE WHEN condition THEN result | ||
/// [WHEN ...] | ||
/// [ELSE result] | ||
/// END | ||
CaseNoExpression, | ||
/// CASE expression | ||
/// WHEN value THEN result | ||
/// [WHEN ...] | ||
/// [ELSE result] | ||
/// END | ||
CaseWithExpression, | ||
/// This is a specialization for a specific use case where we can take a fast path | ||
/// CASE WHEN condition THEN column [ELSE NULL] END | ||
CaseColumnOrNull, | ||
} | ||
|
||
/// The CASE expression is similar to a series of nested if/else and there are two forms that | ||
/// can be used. The first form consists of a series of boolean "when" expressions with | ||
/// corresponding "then" expressions, and an optional "else" expression. | ||
|
@@ -61,6 +80,8 @@ pub struct CaseExpr { | |
when_then_expr: Vec<WhenThen>, | ||
/// Optional "else" expression | ||
else_expr: Option<Arc<dyn PhysicalExpr>>, | ||
/// Evaluation method to use | ||
eval_method: EvalMethod, | ||
} | ||
|
||
impl std::fmt::Display for CaseExpr { | ||
|
@@ -89,10 +110,22 @@ impl CaseExpr { | |
if when_then_expr.is_empty() { | ||
exec_err!("There must be at least one WHEN clause") | ||
} else { | ||
let eval_method = if expr.is_some() { | ||
EvalMethod::CaseWithExpression | ||
} else if when_then_expr.len() == 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about this optimization and I think it would be valid for any CASE expression that had a NULL ELSE, not just column So in other words, I think you could remove the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally, it is only safe to use this approach for expressions that are infallible. One of the main reasons that regular CaseExpr is expensive is that we need to only evaluate the "true" expression on rows where the predicate has evaluated to true. I do think that this could be extended beyond just Column expressions though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add a comment in the code to explain this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added more comments to explain the rationale and what is safe or not |
||
&& when_then_expr[0].1.as_any().is::<Column>() | ||
&& else_expr.is_none() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From SQL planning, it will be |
||
{ | ||
EvalMethod::CaseColumnOrNull | ||
} else { | ||
EvalMethod::CaseNoExpression | ||
}; | ||
|
||
Ok(Self { | ||
expr, | ||
when_then_expr, | ||
else_expr, | ||
eval_method, | ||
}) | ||
} | ||
} | ||
|
@@ -256,6 +289,36 @@ impl CaseExpr { | |
|
||
Ok(ColumnarValue::Array(current_value)) | ||
} | ||
|
||
/// This function evaluates the specialized case of: | ||
/// | ||
/// CASE WHEN condition THEN column | ||
/// [ELSE NULL] | ||
/// END | ||
fn case_column_or_null(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | ||
let when_expr = &self.when_then_expr[0].0; | ||
let then_expr = &self.when_then_expr[0].1; | ||
if let ColumnarValue::Array(bit_mask) = when_expr.evaluate(batch)? { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is evaluated to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am planning on adding a specialization for scalar values as well to avoid converting scalars to arrays There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, nm, your question is different. Yes, we could implement special handling for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I forgot that in this optimization, the when expression could only be a |
||
let bit_mask = bit_mask | ||
.as_any() | ||
.downcast_ref::<BooleanArray>() | ||
.expect("predicate should evaluate to a boolean array"); | ||
// invert the bitmask | ||
let bit_mask = not(bit_mask)?; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a further optimization I think the when predicate can be transformed to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I will look at this suggestion as a follow on. Great idea. |
||
match then_expr.evaluate(batch)? { | ||
ColumnarValue::Array(array) => { | ||
Ok(ColumnarValue::Array(nullif(&array, &bit_mask)?)) | ||
} | ||
ColumnarValue::Scalar(_) => Err(DataFusionError::Execution( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this might be better as an internal error You could also use the macros like internal_err!("expression did not evaluate to an array") |
||
"expression did not evaluate to an array".to_string(), | ||
)), | ||
} | ||
} else { | ||
Err(DataFusionError::Execution( | ||
"predicate did not evaluate to an array".to_string(), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @viirya I think it would be fairly straightforward here to handle Given you say in https://github.com/apache/datafusion/pull/11534/files#r1683081792
I think we could definitely do it in a follow on PR |
||
)) | ||
} | ||
} | ||
} | ||
|
||
impl PhysicalExpr for CaseExpr { | ||
|
@@ -303,14 +366,21 @@ impl PhysicalExpr for CaseExpr { | |
} | ||
|
||
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | ||
if self.expr.is_some() { | ||
// this use case evaluates "expr" and then compares the values with the "when" | ||
// values | ||
self.case_when_with_expr(batch) | ||
} else { | ||
// The "when" conditions all evaluate to boolean in this use case and can be | ||
// arbitrary expressions | ||
self.case_when_no_expr(batch) | ||
match self.eval_method { | ||
EvalMethod::CaseWithExpression => { | ||
// this use case evaluates "expr" and then compares the values with the "when" | ||
// values | ||
self.case_when_with_expr(batch) | ||
} | ||
EvalMethod::CaseNoExpression => { | ||
// The "when" conditions all evaluate to boolean in this use case and can be | ||
// arbitrary expressions | ||
self.case_when_no_expr(batch) | ||
} | ||
EvalMethod::CaseColumnOrNull => { | ||
// Specialization for CASE WHEN expr THEN column [ELSE NULL] END | ||
self.case_column_or_null(batch) | ||
} | ||
} | ||
} | ||
|
||
|
@@ -409,7 +479,7 @@ pub fn case( | |
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
use crate::expressions::{binary, cast, col, lit}; | ||
use crate::expressions::{binary, cast, col, lit, BinaryExpr}; | ||
|
||
use arrow::buffer::Buffer; | ||
use arrow::datatypes::DataType::Float64; | ||
|
@@ -419,6 +489,7 @@ mod tests { | |
use datafusion_common::tree_node::{Transformed, TransformedResult, TreeNode}; | ||
use datafusion_expr::type_coercion::binary::comparison_coercion; | ||
use datafusion_expr::Operator; | ||
use datafusion_physical_expr_common::expressions::Literal; | ||
|
||
#[test] | ||
fn case_with_expr() -> Result<()> { | ||
|
@@ -998,6 +1069,53 @@ mod tests { | |
Ok(()) | ||
} | ||
|
||
#[test] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be good to make sure we had a Or we could start adding a file just for CASE if we are about to spend a bunch of time optimizing it 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I willl add slt tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am planning on more CASE optimizations so will create a separate file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added the slt test. This is my first time using slt and it is very cool |
||
fn test_column_or_null_specialization() -> Result<()> { | ||
// create input data | ||
let mut c1 = Int32Builder::new(); | ||
let mut c2 = StringBuilder::new(); | ||
for i in 0..1000 { | ||
c1.append_value(i); | ||
if i % 7 == 0 { | ||
c2.append_null(); | ||
} else { | ||
c2.append_value(&format!("string {i}")); | ||
} | ||
} | ||
let c1 = Arc::new(c1.finish()); | ||
let c2 = Arc::new(c2.finish()); | ||
let schema = Schema::new(vec![ | ||
Field::new("c1", DataType::Int32, true), | ||
Field::new("c2", DataType::Utf8, true), | ||
]); | ||
let batch = RecordBatch::try_new(Arc::new(schema), vec![c1, c2]).unwrap(); | ||
|
||
// CaseWhenExprOrNull should produce same results as CaseExpr | ||
let predicate = Arc::new(BinaryExpr::new( | ||
make_col("c1", 0), | ||
Operator::LtEq, | ||
make_lit_i32(250), | ||
)); | ||
let expr = CaseExpr::try_new(None, vec![(predicate, make_col("c2", 1))], None)?; | ||
assert!(matches!(expr.eval_method, EvalMethod::CaseColumnOrNull)); | ||
match expr.evaluate(&batch)? { | ||
ColumnarValue::Array(array) => { | ||
assert_eq!(1000, array.len()); | ||
assert_eq!(785, array.null_count()); | ||
} | ||
_ => unreachable!(), | ||
} | ||
Ok(()) | ||
} | ||
|
||
fn make_col(name: &str, index: usize) -> Arc<dyn PhysicalExpr> { | ||
Arc::new(Column::new(name, index)) | ||
} | ||
|
||
fn make_lit_i32(n: i32) -> Arc<dyn PhysicalExpr> { | ||
Arc::new(Literal::new(ScalarValue::Int32(Some(n)))) | ||
} | ||
|
||
fn generate_case_when_with_type_coercion( | ||
expr: Option<Arc<dyn PhysicalExpr>>, | ||
when_thens: Vec<WhenThen>, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm wondering why this special format of case when is not optimized to
if/else
during query optimization.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DataFusion has an
if/else
expression. We have one in Comet that could be upstreamed.