-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-3719][VL] Introduce VeloxIntermediateData to adjust velox agg func intermediate data #3721
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 2)) | ||
expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 3)) | ||
expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 0)) | ||
case _ @VeloxIntermediateData.Type(veloxTypes: Seq[DataType]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused here. Why can we match aggregateFunction
against VeloxIntermediateData.Type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will use VeloxIntermediateData.Type.unapply
method to extract veloxTypes
from aggFunc. This is equivalent to val veloxTypes = VeloxIntermediateData.Type.unapply(aggFunc)
val (sparkOrders, sparkTypes) = | ||
aggFunc.aggBufferAttributes.map(attr => (attr.name, attr.dataType)).unzip | ||
val veloxOrders = VeloxIntermediateData.veloxIntermediateDataOrder(aggFunc) | ||
val adjustedOrders = sparkOrders.map(veloxOrders.indexOf(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough to decide the order based on name equality? E.g., if attr.name contains suffix of exprId, would it fail to match with the string in veloxOrders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the column names in aggBufferAttributes
are fixed, it will not contains suffix of exprId. You can check the implementation of each agg. For example,
@@ -133,56 +123,29 @@ case class HashAggregateExecTransformer( | |||
case _ => | |||
throw new UnsupportedOperationException(s"${expr.mode} not supported.") | |||
} | |||
val aggFunc = expr.aggregateFunction | |||
expr.aggregateFunction match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use aggFunc defined in the previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing these details. Will merge after internal CI passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
What changes were proposed in this pull request?
If there are inconsistencies in the order and type of intermediate data between Velox and Spark aggregation functions, special handling is required. We can introduce a
VeloxIntermediateData
object to handle these cases. For non-special cases, we can continue using theaggBufferAttributes
of the aggregation function without any special matching. In future PRs, the remaining methods for handling such aggregation functions will be incorporated into this VeloxIntermediateData object.In the
applyExtractStruct
function, a lot of code was written to match the intermediate data outputted by Velox with the column data in Spark's agg buffer. These code segments involved many index order adjustments, which made them difficult to read and understand why such ordering was necessary. For example (It is difficult to understand the significance of [1, 4, 5, 0, 2, 3]),In this PR, all these code segments have been modified and improved.
How was this patch tested?
Exists CI