Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation on TreeNode #10035

Merged
merged 11 commits into from
Apr 22, 2024
130 changes: 79 additions & 51 deletions datafusion/common/src/tree_node.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! This module provides common traits for visiting or rewriting tree
//! data structures easily.
//! [`TreeNode`] for visiting and rewriting expression and plan trees

use std::sync::Arc;

Expand All @@ -43,17 +42,32 @@ macro_rules! handle_transform_recursion_up {
}};
}

/// Defines a visitable and rewriteable tree node. This trait is implemented
/// for plans ([`ExecutionPlan`] and [`LogicalPlan`]) as well as expression
/// trees ([`PhysicalExpr`], [`Expr`]) in DataFusion.
/// API for visiting (read only) and rewriting tree data structures .
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just tried to explain how this API works better.

///
/// Note: this trait is implemented for plans ([`ExecutionPlan`] and
/// [`LogicalPlan`]) as well as expression trees ([`PhysicalExpr`], [`Expr`]).
///
/// # Common APIs:
/// * [`Self::apply`] and [`Self::visit`] for inspecting (without modification) borrowed nodes
/// * [`Self::transform`] and [`Self::rewrite`] for rewriting owned nodes
///
/// # Terminology
/// The following terms are used in this trait
///
/// * `f_down`: Invoked before any children of the current node are visited.
/// * `f_up`: Invoked after all children of the current node are visited.
/// * `f`: closure that is applied to the current node.
/// * `map_*`: applies a transformation to rewrite owned nodes
/// * `apply_*`: invokes a function on borrowed nodes
/// * `transform_`: applies a transformation to rewrite owned nodes
Comment on lines +83 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im probably missing it but does it make sense to add more details to distinguish map_* and transform_ here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand the difference between map and transform here. Is the former creating a new node and the latter mutating an existing node?

Copy link
Contributor

@wiedld wiedld Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a shot at consolidating the terminology. It's probably wrong, but hopefully it will help the conversation. 🙈 (edited)

API action traversal ref actor
apply traverse, no rewrite top-down, pre-order &self FnMut
transform traverse & rewrite suffix dependent self suffix dependent
- - *_up suffix = bottom-up, post-order - -
- - *_down suffix = top-down, pre-order - -
- - *_up_down suffix = does both, see docs - -
- - - - Fn
- - - - *_mut suffix = FnMut
visit visit, no rewrite depth-first † &self dyn TreeNodeVisitor
rewrite visit & rewrite depth-first † &self dyn TreeNodeRewrite
apply_children inspect children, no rewrite not traversal per se‡ &self Fn
map_children inspect children & rewrite in place not traversal per se‡ self FnMut

† The specific implementations of the visitors (TreeNodeVisitor or TreeNodeRewrite) can impact the traversal patterns.
‡ I think the implementations decide how to handle any travsersal (a.k.a. does the mapping of children inspect it's own children). But I'm not really clear on this one.

Caveat (edited): I didn't attempt to generalize the map_* terminology (vs transform) as there are several other use cases across intersecting code paths (and traits), which I haven't dug into yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above table is not quite right and misses a few things. But, let me comment on this ticket tomorrow when I'm back home and have better github access.

Copy link
Contributor

@peter-toth peter-toth Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous efforts (especially #8891 but others too: #9876, #9913, #9965, #9999) consolidated the behaviour of existing APIs (consistent TreeNodeRecursion behaviour and common Transformed return types of transforming APIs) but didn't try to normalize naming of the APIs in favour of causing the least incompatibility with previous DataFusion versions. So this means that the current API names evolved organically and thus they might not be intuitive in all cases and are a bit inconsistent here and there...

Basically, there are 2 kinds of TreeNode APIs:

  1. "Inspecting" APIs to taverse over a tree and inspect &TreeNodes.
  2. "Transforming" APIs to traverse over a tree while consuming TreeNodes and produce possibly changed TreeNodes
    (i.e. there is no in-place mutation API, that uses &mut TreeNode, but we are improving the consume and produce APIs to cleate less new objects unnecessarly e.g.: Avoid LogicalPlan::clone() in LogicalPlan::map_children when possible #9999)

There are only 2 inspecting TreeNode APIs:

  • TreeNode::apply():
    Top-down (pre-order) traverses the nodes and applies the povided f closure on the nodes.
    • As TreeNode trees are hierarchical collections of nodes, something like foreach() would better describe the operation.
    • Because the API name doesn't explicitely defines the traversal direction IMO its docs shouldn't say anything about its implementation just the fact that it should be used for order agnostic inspection of nodes.
    • Its transforming counterpart should be TreeNode::transform(). But TreeNode::transform() is an alias for TreeNode::transform_up() so it does a bottom-up (post-order) traversal.
  • TreeNode::visit():
    Requires a TreeNodeVisitor that incorporates 2 methods to do a combined traversal. TreeNodeVisitor::f_down() is called in top-down order, TreeNodeVisitor::f_up() is called in bottom-up order on the nodes in one combined traversal.
    • As there is no apply_up() exists, visit() can be used to do a bottom-up inspecting traversal (TreeNodeVisitor::f_down() can remain its default implementation, that does nothing).
    • Its transforming counterpart is TreeNode::rewrite().

There are 7 transforming TreeNode APIs:

  • TreeNode::transform():
    An alias for TreeNode::transform_up() so it does bottom-up traversal on the nodes. Consumes them with the provided f closure and produces a possibly transformed node.
    • Should be the transforming counterpart of apply() but this runs bottom-up for some reason...
    • IMO its docs shouldn't mention traversal direction and probably should be used for order agnostic transformation of nodes.
  • TreeNode::transform_down() and TreeNode::transform_down_mut():
    These 2 are top-down traverse the nodes, consume them with the provided f closure and produce a possibly transformed node.
    • I have no idea why we have the ..._mut() version, most likely for just convenience, but IMO one transform_down() with FnMut should be fair enough.
  • TreeNode::transform_up() and TreeNode::transform_up_mut():
    These 2 are the bottom-up versions of the above.
    • I have no idea why we have the ..._mut() version.
    • Don't have an inspecting counterpart
  • TreeNode::transform_down_up():
    This API accepts 2 closures f_down and f_up. f_down is called in top-down order and f_up is called in bottom-up order on the nodes in one combined traversal.
    • Doesn't have an inspecting counterpart
  • TreeNode::rewrite():
    Requires a TreeNodeRewriter that incorporates 2 methods to do a combined traversal. TreeNodeRewriter::f_down() is called in top-down order, TreeNodeRewriter::f_up() is called in bottom-up order on the nodes in one combined traversal.
    • This API is the transforming counterpart of TreeNode::visit()
    • If there is no mutable shared data between TreeNodeRewriter::f_down() and TreeNodeRewriter::f_up() then it is easier to use TreeNode::transform_down_up() that accepts f_down and f_up closures directly and no need to define a TreeNodeRewriter instance.

Additional TreeNode APIs:

  • TreeNode::apply_children() and TreeNode::map_children():
    Building blocks for the above 2 + 7 TreeNode APIs. apply_children() inspects, map_children() transforms the children of a TreeNode node with the provided f closure.
    • TreeNode implementations (e.g. LogicalPlan, Expr, Arc<dyn ExecutionPlan>, Arc<dyn PhysicalExpr>, ...) have to implement apply_children() for inspecting and map_children() for transforming APIs.
    • These 2 methods are public, but they should almost never be called explicitely by the TreeNode API users. If, for some reason, the above 2 + 7 APIs don't cover the recursion needed then these 2 methods can be used to define a custom algorithms.
    • The apply_children() name is aligned with apply(), but have no idea why we call the other map_children(), transform_children() would probably be a better fit...

Besides the above general TreeNode APIs we have some LogicalPlan APIs:

  • LogicalPlan::..._with_subqueries():
    All the above 2 + 7 TreeNode APIs have a LogicalPlan::..._with_subqueries() version that do the same as the original TreeNode API but also recurse into subqueries (same type inner LogicalPlan trees that are defined in the expressions of the original LogicalPlan tree nodes).
    • We could generalize this concept into TreeNode but LogicalPlan seems to be only TreeNode type that makes use of these APIs.

Additional LogicalPlan APIs:

  • LogicalPlan::apply_expressions(), LogicalPlan::map_expressions(), LogicalPlan::apply_subqueries() and LogicalPlan::map_subqueries():
    Building blocks for LogicalPlan::..._with_subqueries() APIs. apply_expressions() and apply_subqueries() inspect, map_expressions() and map_subqueries() transform the expressions/subqueries of a LogicalPlan node with the provided f closure.
    • These 4 are similar to TreeNode::apply_children() and TreeNode::map_children() and so they are rarely called directly but can be used to define a custom algorithms.

Here is the summary of the current situation:

TreeNode APIs:

traversal order inspecting transforming
order agnostic transform()
top-down apply() transform_down(), transform_down_mut()
bottom-up transform_up(), transform_up_mut()
combined with separete f_down and f_up closures transform_down_up()
combined with incorporated f_down() and f_up() methods into an object visit() + TreeNodeVisitor rewrite() + TreeNodeRewriter

Additional TreeNode APIs: apply_children(), map_children().

LogicalPlan APIs:

traversal order inspecting transforming
order agnostic transform_with_subqueries()
top-down apply_with_subqueries() transform_down_with_subqueries(), transform_down_mut_with_subqueries()
bottom-up transform_up_with_subqueries(), transform_up_mut_with_subqueries()
combined with separete f_down and f_up closures transform_down_up_with_subqueries()
combined with incorporated f_down() and f_up() methods into an object visit_with_subqueries() + TreeNodeVisitor rewrite_with_subqueries() + TreeNodeRewriter

Additional LogicalPlan APIs: apply_expressions(), map_expressions(), apply_subqueries(), map_subqueries()

Copy link
Contributor

@peter-toth peter-toth Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if we wanted to consolidate the TreeNode API terminonlogy I would suggest the following:

traversal order inspecting transforming
order agnostic for_each() (new API, an alias to for_each_up(), but its docs shouldn't specify any order) transform() (remains an alias to transform_up() but its docs shouldn't specify any order)
top-down for_each_down() (new API), apply() (deprecated, from now just an alias to for_each_down()) transform_down() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>>, which is a breaking change but requires a trivial fix from the users), transform_down_mut() (deprecated, from now just an alias to transform_down())
bottom-up for_each_up() (new API) transform_up() (changes to f: &mut F where F: FnMut(Self) -> Result<Transformed<Self>>, which is a breaking change but requires a trivial fix from the users), transform_up_mut() (deprecated, from now just an alias to transform_up())
combined with separete f_down and f_up closures transform_down_up()
combined with incorporated f_down() and f_up() methods into an object visit() + TreeNodeVisitor rewrite() + TreeNodeRewriter

+ Maybe rename apply_children() to for_each_children() and map_children() to transform_children(). Although renaming these would be a breaking change if something that is built on DataFusion defines a new kind of TreeNode, but fixing them would also be trivial.

+ LogicalPlan::..._with_subqueries() APIs should be alligned with the above TreeNode ones, but as they are fairly new (#9913) they haven't landed in any release yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this information Peter. Amazing and mind-blowing

I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.

I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two threads to pursue here. First what currently exists and second what we can do to improve the situation.

I will attempt to reformat this information about how things currently work into this pr. Then I will see if there is anything we can pull out into follow on tasks (eg remove the _mut variants)

Yes, I totally agree with you. It is more important to document what we already have. I just wanted to give you some details about the remaining inconsistency in API naming. If we pursue to fix it, I'm happy to open a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, update here:

  1. I incorporated the information (and table!) from @wiedld and @peter-toth in this PR's documentation
  2. Added the suggestion for consolidated terminology to Epic: Unified TreeNode rewrite API #8913 (comment)
  3. Filed Deprecate TreeNode transform_xx_mut methods #10097 to clean up the transform APIs

///
/// <!-- Since these are in the datafusion-common crate, can't use intra doc links) -->
/// [`ExecutionPlan`]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
/// [`PhysicalExpr`]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.PhysicalExpr.html
/// [`LogicalPlan`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/logical_plan/enum.LogicalPlan.html
/// [`Expr`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/expr/enum.Expr.html
pub trait TreeNode: Sized {
/// Visit the tree node using the given [`TreeNodeVisitor`], performing a
/// Visit the tree node with a [`TreeNodeVisitor`], performing a
/// depth-first walk of the node and its children.
///
/// See also:
Expand Down Expand Up @@ -93,8 +107,8 @@ pub trait TreeNode: Sized {
.visit_parent(|| visitor.f_up(self))
}

/// Implements the [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) for
/// recursively transforming [`TreeNode`]s.
/// Rewrite the tree node with a [`TreeNodeRewriter`], performing a
/// depth-first walk of the node and its children.
///
/// See also:
/// * [`Self::visit`] for inspecting (without modification) `TreeNode`s
Expand Down Expand Up @@ -129,79 +143,87 @@ pub trait TreeNode: Sized {
})
}

/// Applies `f` to the node and its children. `f` is applied in a pre-order
/// way, and it is controlled by [`TreeNodeRecursion`], which means result
/// of the `f` on a node can cause an early return.
/// Applies `f` to the node then each of its children, recursively (a
/// top-down, pre-order traversal).
///
/// The `f` closure can be used to collect some information from tree nodes
/// or run a check on the tree.
/// The return [`TreeNodeRecursion`] controls the recursion and can cause
/// an early return.
fn apply<F: FnMut(&Self) -> Result<TreeNodeRecursion>>(
&self,
f: &mut F,
) -> Result<TreeNodeRecursion> {
f(self)?.visit_children(|| self.apply_children(|c| c.apply(f)))
}

/// Convenience utility for writing optimizer rules: Recursively apply the
/// given function `f` to the tree in a bottom-up (post-order) fashion. When
/// `f` does not apply to a given node, it is left unchanged.
/// Recursively rewrite the node's children and then the node using `f`
/// (a bottom-up post-order traversal).
///
/// A synonym of [`Self::transform_up`]. `f` is applied to the nodes
/// children first, and then to the node itself. See [`Self::transform_down`]
/// top-down (pre-order) traversal.
///
/// Use [`TreeNodeRewriter`] for an API that supports transformations both
/// down and up.
fn transform<F: Fn(Self) -> Result<Transformed<Self>>>(
self,
f: &F,
) -> Result<Transformed<Self>> {
self.transform_up(f)
}

/// Convenience utility for writing optimizer rules: Recursively apply the
/// given function `f` to a node and then to its children (pre-order traversal).
/// When `f` does not apply to a given node, it is left unchanged.
/// Recursively rewrite the tree using `f` in a top-down (pre-order)
/// fashion.
///
/// `f` is applied to the nodes first, and then its children.
///
/// Use [`TreeNodeRewriter`] for an API that supports transformations both
/// down and up the tree.
fn transform_down<F: Fn(Self) -> Result<Transformed<Self>>>(
self,
f: &F,
) -> Result<Transformed<Self>> {
handle_transform_recursion_down!(f(self), |c| c.transform_down(f))
}

/// Convenience utility for writing optimizer rules: Recursively apply the
/// given mutable function `f` to a node and then to its children (pre-order
/// traversal). When `f` does not apply to a given node, it is left unchanged.
/// Same as [`Self::transform_down`] but with a mutable closure.
fn transform_down_mut<F: FnMut(Self) -> Result<Transformed<Self>>>(
self,
f: &mut F,
) -> Result<Transformed<Self>> {
handle_transform_recursion_down!(f(self), |c| c.transform_down_mut(f))
}

/// Convenience utility for writing optimizer rules: Recursively apply the
/// given function `f` to all children of a node, and then to the node itself
/// (post-order traversal). When `f` does not apply to a given node, it is
/// left unchanged.
/// Recursively rewrite the node using `f` in a bottom-up (post-order)
/// fashion.
///
/// `f` is applied to children first, and then to the node itself.
///
/// Use [`TreeNodeRewriter`] for an API that supports transformations both
/// down and up the tree.
fn transform_up<F: Fn(Self) -> Result<Transformed<Self>>>(
self,
f: &F,
) -> Result<Transformed<Self>> {
handle_transform_recursion_up!(self, |c| c.transform_up(f), f)
}

/// Convenience utility for writing optimizer rules: Recursively apply the
/// given mutable function `f` to all children of a node, and then to the
/// node itself (post-order traversal). When `f` does not apply to a given
/// node, it is left unchanged.
/// Same as [`Self::transform_up`] but with a mutable closure.

fn transform_up_mut<F: FnMut(Self) -> Result<Transformed<Self>>>(
self,
f: &mut F,
) -> Result<Transformed<Self>> {
handle_transform_recursion_up!(self, |c| c.transform_up_mut(f), f)
}

/// Transforms the tree using `f_down` while traversing the tree top-down
/// Transforms the node using `f_down` while traversing the tree top-down
/// (pre-order), and using `f_up` while traversing the tree bottom-up
/// (post-order).
///
/// Use this method if you want to start the `f_up` process right where `f_down` jumps.
/// This can make the whole process faster by reducing the number of `f_up` steps.
/// If you don't need this, it's just like using `transform_down_mut` followed by
/// `transform_up_mut` on the same tree.
/// `transform_up_mut` on the same node.
///
/// Consider the following tree structure:
/// ```text
Expand Down Expand Up @@ -298,7 +320,7 @@ pub trait TreeNode: Sized {
)
}

/// Returns true if `f` returns true for node in the tree.
/// Returns true if `f` returns true for any node in the tree.
///
/// Stops recursion as soon as a matching node is found
fn exists<F: FnMut(&Self) -> bool>(&self, mut f: F) -> bool {
Expand All @@ -315,16 +337,21 @@ pub trait TreeNode: Sized {
found
}

/// Apply the closure `F` to the node's children.
/// Apply `f` to inspect node's children (but not the node itself)
///
/// See [`Self::apply`] and [`Self::visit`] for APIs that include self.
///
/// See `mutate_children` for rewriting in place
/// See [`Self::map_children`] for rewriting in place
fn apply_children<F: FnMut(&Self) -> Result<TreeNodeRecursion>>(
&self,
f: F,
) -> Result<TreeNodeRecursion>;

/// Apply transform `F` to potentially rewrite the node's children. Note
/// that the transform `F` might have a direction (pre-order or post-order).
/// Apply `f` to rewrite the node's children (but not the node itself).
///
/// See [`Self::transform`] and [`Self::rewrite`] for APIs that include self
///
/// See [`Self::apply_children`] for inspecting in place
fn map_children<F: FnMut(Self) -> Result<Transformed<Self>>>(
self,
f: F,
Expand Down Expand Up @@ -436,26 +463,26 @@ impl TreeNodeRecursion {
}
}

/// This struct is used by tree transformation APIs such as
/// Result of tree walk / transformation APIs
///
/// API users control the transformation by returning:
/// - The resulting (possibly transformed) node,
/// - `transformed`: flag indicating whether any change was made to the node
/// - `tnr`: [`TreeNodeRecursion`] specifying how to proceed with the recursion.
///
/// At the end of the transformation, the return value will contain:
/// - The final (possibly transformed) tree,
/// - `transformed`: flag indicating whether any change was made to the node
/// - `tnr`: [`TreeNodeRecursion`] specifying how the recursion ended.
///
/// Example APIs:
/// - [`TreeNode`],
/// - [`TreeNode::rewrite`],
/// - [`TreeNode::transform_down`],
/// - [`TreeNode::transform_down_mut`],
/// - [`TreeNode::transform_up`],
/// - [`TreeNode::transform_up_mut`],
/// - [`TreeNode::transform_down_up`]
///
/// to control the transformation and return the transformed result.
///
/// Specifically, API users can provide transformation closures or [`TreeNodeRewriter`]
/// implementations to control the transformation by returning:
/// - The resulting (possibly transformed) node,
/// - A flag indicating whether any change was made to the node, and
/// - A flag specifying how to proceed with the recursion.
///
/// At the end of the transformation, the return value will contain:
/// - The final (possibly transformed) tree,
/// - A flag indicating whether any change was made to the tree, and
/// - A flag specifying how the recursion ended.
#[derive(PartialEq, Debug)]
pub struct Transformed<T> {
pub data: T,
Expand Down Expand Up @@ -637,6 +664,7 @@ impl<I: Iterator> TreeNodeIterator for I {

/// Transformation helper to process a heterogeneous sequence of tree node containing
/// expressions.
///
/// This macro is very similar to [TreeNodeIterator::map_until_stop_and_collect] to
/// process nodes that are siblings, but it accepts an initial transformation (`F0`) and
/// a sequence of pairs. Each pair is made of an expression (`EXPR`) and its
Expand Down
Loading