-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LogicalPlanStats
to logical plan nodes
#13618
base: main
Are you sure you want to change the base?
Conversation
cc @alamb , @berkaysynnada , @findepi I haven't created any issue for this change yet, but will do if we think it is worth doing such a huge breaking change. |
I love the idea in principle (vide #12604) There are however some important questions
Those observations lead to conclusions on the design what should and shouldn't be part of a plan (logical or physical):
To me, the best way to address these various needs would be
|
Yeah, I see your point. I think this all comes down to if we want to / have the resources to implement a new IR plan structure and refactor the existing analyzer / optimizer rules. (And it can be also a breaking change to projects that have their own rules...) Or, we want to / can adjust the existing logical plan to incorporate the above ideas without much API breaking changes and so keep the existing optimizer and the exsisting logical to physical plan conversion. |
The LogicalPlan loose typing, overly rich Expr syntax and close ties to the syntax tree are an ongoing maintenance tax we continue to pay with every new change, every new feature, so I do hope we accumulate enough decidedness to commence to a new IR. I personally would be very interested in working on this, but this also requires buy-in from other project maintainers. |
This is quite cool 🤯
I don't know if we have the resources to do this. I know I don't have the bandwidth to help drive it forward but there are now quite a few other high bandwidth maintainers who might be able to do so. My personal focus for the next few months is likely on making DataFusion more stable for existing systems, which is likely not exacly aligned with making major changes. However I think internal refactoring is possible (we did it with function representation --> all to udfs) it just needs sustained coding and organizational effort |
The pre-computed bitset to know what is in each subtree is a very neat idea |
I am cleaning up the review queue and this PR has a pile of conflicts. I think it is a neat POC type approach, but I don't think we are planning to proceed as is -- though please correct me if I got that wrong |
Hi @alamb, sure we can close this PR. If I get the above comments right, there will be a new IR plan structure and the current optimizer rules will be rewritten to work on that. Once DataFusion reached that point we can revisit this PR to improve traversal speed of optimizer rules. However, the idea in this PR is not specific to optimizer rules. It can improve analyzer rules (I think we want to keep them using The problem with |
Maybe that would make more sense. I just feel like our iteration code has gotten pretty intense lately (and it is already getting pretty hard to understand). I understand the need for optimization, but I think we also need to balance it with maintainability / readability |
Atually I just realized after reading @jayzhan211's comment that |
I didn't notice this change, it looks quite similar to what I proposed. What is the current challenge faced to? It seems like HUGE breaking change? |
Yeah, adding a new field to |
I quickly scan the change. I think we don't require to modify pub struct Projection {
/// The list of expressions
pub expr: Vec<Expr>,
/// The incoming logical plan
pub input: Arc<LogicalPlan>,
/// The schema description of the output
pub schema: DFSchemaRef,
// Any kind of computed information place here
properties: LogicalPlanProperties
} I guess this reduces about half the change. And given the approach in the above, we can also change the plan one by one (I guess), so we can iterate the change in a smaller and easy review one. Some plans such as |
Yeah, most likely we don't need The property ( |
Which issue does this PR close?
This a proof of concept PR to improve performance of
TreeNode
traversals. The main purpose of this PR is to demonstrate a promising but API breaking optimization.Rationale for this change
TreeNode
traversal APIs are crucial parts of query plan analysis and optimzation. This PR explores the idea of storing some pre-calculated statistics/properties of nodes (or subtrees) inside the nodes during creation and automatically update the values during transfomations.This PR focuses on logical plan building blocks (
LogicalPlan
,Expr
) and only one particular optimization that stores a bitset pattern in each node to describe the subtree content, but additional attributes/properties can be added in follow-up PRs like:Expr::get_type()
(LargeOR
list overflows the stack #9375 (comment), Proposal: introduced typed expressions, separate AST and IR #12604)and the idea can be extended to physical trees as well.
What changes are included in this PR?
To store the pre-calculated statistics/properties
LogicalPlanStats
struct fields are added to eachLogicalPlan
andExpr
nodes:This might look redundant but most likely this is the least intrusive way to the existing code. Pattern matching against
LogicalPlan
andExpr
just need to add a new param, while new enum constructor methods can be defined as lowercase version of the enum items (if there is no conflict with existing methods). The constructor methods calculate theLogicalPlanStats
fields based on main content of the node.Please note that even this approach requites quite a lot of API breaking changes in the codebase, but all are mechanical. The following table summarizes the before/after code to construct the enum items. (I just used the
_
prefix to avoid conflicts, but we can come up with better names.)Expr::Alias
Expr::alias_qualified
Expr::Column
Expr::column
Expr::ScalarVariable
Expr::scalar_variable
Expr::Literal
Expr::literal
Expr::BinaryExpr
Expr::binary_expr
Expr::Like
Expr::_like
Expr::SimilarTo
Expr::similar_to
Expr::Not
Expr::_not
Expr::IsNotNull
Expr::_is_not_null
Expr::IsNull
Expr::_is_null
Expr::IsTrue
Expr::_is_true
Expr::IsFalse
Expr::_is_false
Expr::IsUnknown
Expr::_is_unknown
Expr::IsNotTrue
Expr::_is_not_true
Expr::IsNotFalse
Expr::_is_not_false
Expr::IsNotUnknown
Expr::_is_not_unknown
Expr::Negative
Expr::negative
Expr::Between
Expr::_between
Expr::Case
Expr::case
Expr::Cast
Expr::cast
Expr::TryCast
Expr::ScalarFunction
Expr::scalar_function
Expr::AggregateFunction
Expr::aggregate_function
Expr::WindowFunction
Expr::window_function
Expr::InList
Expr::_in_list
Expr::Exists
Expr::exists
Expr::InSubquery
Expr::in_subquery
Expr::ScalarSubquery
Expr::scalar_subquery
Expr::Wildcard
Expr::wildcard
Expr::GroupingSet
Expr::grouping_set
Expr::Placeholder
Expr::placeholder
Expr::OuterReferenceColumn
Expr::outer_reference_column
Expr::Unnest
Expr::unnest
LogicalPlan::Projection
LogicalPlan::projection
LogicalPlan::Filter
LogicalPlan::filter
LogicalPlan::Window
LogicalPlan::window
LogicalPlan::Aggregate
LogicalPlan::aggregate
LogicalPlan::Sort
LogicalPlan::sort
LogicalPlan::Join
LogicalPlan::join
LogicalPlan::Repartition
LogicalPlan::repartition
LogicalPlan::Union
LogicalPlan::union
LogicalPlan::TableScan
LogicalPlan::table_ccan
LogicalPlan::EmptyRelation
LogicalPlan::empty_relation
LogicalPlan::Subquery
LogicalPlan::subquery
LogicalPlan::SubqueryAlias
LogicalPlan::subquery_alias
LogicalPlan::Limit
LogicalPlan::limit
LogicalPlan::Statement
LogicalPlan::statement
LogicalPlan::Values
LogicalPlan::values
LogicalPlan::Explain
LogicalPlan::explain
LogicalPlan::Analyze
LogicalPlan::analyze
LogicalPlan::Extension
LogicalPlan::extension
LogicalPlan::Distinct
LogicalPlan::distinct
LogicalPlan::Dml
LogicalPlan::dml
LogicalPlan::Ddl
LogicalPlan::ddl
LogicalPlan::Copy
LogicalPlan::copy
LogicalPlan::DescribeTable
LogicalPlan::describe_table
LogicalPlan::Unnest
LogicalPlan::unnest
LogicalPlan::RecursiveQuery
LogicalPlan::recursive_query
For the above mentioned pattern based optimization the PR defines a
LogicalPlanPattern
enum that contains all possible node kinds of a logical plan:A bitset of
LogicalPlanPattern
enum is added to theLogicalPlanStats
struct to reflect the content of the node's subtree. The implementation could use any kind of bitset, but https://docs.rs/enumset/latest/enumset/ looks like a good candidate.For example here are a few
Expr
item constructors:While maintaining the bitset during tree transformations comes with some costs, with the bitset we can speed up
LogicalPlan
andExpr
traversals significantly. For example if we have a traversal that does something withExpr::BinaryExpr
nodes only:then we can check the presence of
Expr::BinaryExpr
in a subtree and simply skip traversing subtrees without theLogicalPlanPattern::ExprBinaryExpr
pattern:I modified some of the traversal functions in this PR to demonstrate that the optimization brings significant performance improvement to
sql_planner
:Most likely further improvements could be achievable by using the new patterns in more traversals, but let's leave it to follow-up PRs.
To sum up this PR contains 3 initial commits:
Expr::Wildcard
to incorporate its fields to a namedWildcard
struct. This makesExpr::Wildcard
similar to otherExpr
items.LogicalPlanStats
container to allLogicalPlan
andExpr
enum items.LogicalPlanStats
is just an empty struct in this commit and all the changes are mechanical.LogicalPlanStats
and adjusts some of the logical plan traversals to utilize it.Are these changes tested?
Yes, with existing tests.
Are there any user-facing changes?
No.