-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Optimize CASE expression for "column or null" use case #11534
Conversation
/// END | ||
CaseWithExpression, | ||
/// This is a specialization for a specific use case where we can take a fast path | ||
/// CASE WHEN condition THEN column [ELSE NULL] END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm wondering why this special format of case when is not optimized to if/else
during query optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DataFusion has an if/else
expression. We have one in Comet that could be upstreamed.
EvalMethod::CaseWithExpression | ||
} else if when_then_expr.len() == 1 | ||
&& when_then_expr[0].1.as_any().is::<Column>() | ||
&& else_expr.is_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If else
value is a null literal, is else_expr
still none? Or is it a null literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From SQL planning, it will be None
, but it makes sense to add code here to normalize this for other use cases (other query engines that delegate to DataFusion). I have added this.
fn case_column_or_null(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | ||
let when_expr = &self.when_then_expr[0].0; | ||
let then_expr = &self.when_then_expr[0].1; | ||
if let ColumnarValue::Array(bit_mask) = when_expr.evaluate(batch)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is evaluated to Scalar
, maybe we still can work on it? I.e., we can convert the scalar value to an boolean array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning on adding a specialization for scalar values as well to avoid converting scalars to arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, nm, your question is different. Yes, we could implement special handling for CASE WHEN [true|false] THEN
but I'm not sure that is a real-world use case that is worth optimizing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot that in this optimization, the when expression could only be a Column
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove -- I think the performance improvement is pretty compelling and this PR does what it says.
I think it would be great to continue to optimize the CASE evaluation in follow on PRs. I bet if you filed some tickets others in the community would likely pitch in to help
ColumnarValue::Array(array) => { | ||
Ok(ColumnarValue::Array(nullif(&array, &bit_mask)?)) | ||
} | ||
ColumnarValue::Scalar(_) => Err(DataFusionError::Execution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be better as an internal error
You could also use the macros like
internal_err!("expression did not evaluate to an array")
if when_then_expr.is_empty() { | ||
exec_err!("There must be at least one WHEN clause") | ||
} else { | ||
let eval_method = if expr.is_some() { | ||
EvalMethod::WithExpression | ||
} else if when_then_expr.len() == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this optimization and I think it would be valid for any CASE expression that had a NULL ELSE, not just column
So in other words, I think you could remove the when_then_expr[0].1.as_any().is::<Column>()
check and this would still work fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, it is only safe to use this approach for expressions that are infallible. One of the main reasons that regular CaseExpr is expensive is that we need to only evaluate the "true" expression on rows where the predicate has evaluated to true.
I do think that this could be extended beyond just Column expressions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment in the code to explain this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments to explain the rationale and what is safe or not
} | ||
} else { | ||
Err(DataFusionError::Execution( | ||
"predicate did not evaluate to an array".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @viirya I think it would be fairly straightforward here to handle ColumnarValue::Scalar
s well. I don't think we need to do it in this PR
Given you say in https://github.com/apache/datafusion/pull/11534/files#r1683081792
I am planning on adding a specialization for scalar values as well to avoid converting scalars to arrays
I think we could definitely do it in a follow on PR
@@ -998,6 +1080,53 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to make sure we had a slt
level test to cover this as well,
Or we could start adding a file just for CASE if we are about to spend a bunch of time optimizing it 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I willl add slt tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning on more CASE optimizations so will create a separate file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the slt test. This is my first time using slt and it is very cool
.downcast_ref::<BooleanArray>() | ||
.expect("predicate should evaluate to a boolean array"); | ||
// invert the bitmask | ||
let bit_mask = not(bit_mask)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a further optimization I think the when predicate can be transformed to be not(condition)
so it's possible to use bit_mask
directly instead of not(bit_mask)
.
This makes it possible to optimize/simplify a condition like x=1
to x!=1
instead of not(x=1)
, saving the invert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will look at this suggestion as a follow on. Great idea.
Which issue does this PR close?
Closes #11484
Rationale for this change
Some TPC-DS queries include expressions such as
case when (d_day_name='Sunday') then ss_sales_price else null end
. The current implementation ofCaseExpr
is very expensive for an expression which is basicallyIF(expr, column, null)
. For this case we can take a fast path of just changing the null bitmask on the column.What changes are included in this PR?
The existing
evaluate
method already had conditional logic for choosing between two execution methods based on context. This PR adds a third that specializes for the "column or null" use case.The benchmarks were also improved.
Are these changes tested?
New test added for this optimization.
Are there any user-facing changes?
No, just faster queries in some cases.