-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support User Defined Table Function #8306
Conversation
@alamb PTAL, a draft implementation of user-defined-table-function : ) |
Thank you @Veeupup -- I am very excited to check this out. I plan to do so tomorrow if possible (though it is a holiday in the US so it may not be until Friday or the weekend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I couldn't help myself and took a quick look now @Veeupup -- this is really cool, thank you. Also thank you for putting up a draft PR for early feedback.
I left some comments, let me know what you think.
Thanks again! 🏆
} | ||
|
||
/// Get the function implementation and generate a table | ||
pub fn create_table_provider( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we would allow users to provide their own TableProvider
directly (rather than a PartitionStream
) as this permits them to implement predicate / projection / limit pushdown, among other things
I like how you have made a simple version in the example -- maybe you can just move the creation of StreamingTable::try_new()
into datafusion-examples/examples/simple_udtf.rs
🤔
I am also thinking about how to support TableFunctions that take streams as input: #7926 (comment). Any ideas on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we would allow users to provide their own TableProvider directly (rather than a PartitionStream) as this permits them to implement predicate / projection / limit pushdown, among other things
sure! Agree with you, just implementing their TableProvider
seems more reasonable. I have change the func signature.
I am also thinking about how to support TableFunctions that take streams as input:
maybe users can implement PartitionStream
and use inner StreamTable
to take streams as input? see it in example
/// Get the function implementation and generate a table | ||
pub fn create_table_provider( | ||
&self, | ||
args: &[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of taking strings what would you think about taking Expr
instead (which would allow for more sophisticated things like passing in now()
as sugested by @yukkit in #7926 (comment))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this idea too! I'll try to figure out if we can find a way to evaluate Expr
correctly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb PTAL I have madeExprs
and provider inputs and it looks just fine!
but we still have some problems to solve, you can check TODO in the latest commit.
- how to evaluate expr simpler?
- for now, I left it in
fn scan
to constructExpr
->LogicalPlan
->PhysicalPlan
and evaluate it then, but I think we can make it simpler or elsewhere to evaluateExprs
.
- for now, I left it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to evaluate expr simpler?
you can use ExprSimplifier::simplify
to do this.
Here is an example https://github.com/apache/arrow-datafusion/blob/06bbe1298fa8aa042b6a6462e55b2890969d884a/datafusion-examples/examples/expr_api.rs#L56
I added a commit to Veeupup#1 showing how to make it work: Veeupup@2df01db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this idea too! I'll try to figure out if we can find a way to evaluate
Expr
correctly!
BTW since I agree evaluating expressions is non trivial, I added some additional documentation and examples of how to do so here: #8377
Happy Thanksgiving!! Enjoy your holiday! 😺 I'll checkout your constructive comments later today when I'm free : ) |
26ddf6a
to
050ab9d
Compare
Signed-off-by: veeupup <[email protected]>
050ab9d
to
32ecce4
Compare
cc @yukkit hi, this PR is almost ready for review, comments are welcome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Veeupup -- I think this is looking very close. (Another) really nice PR 🦾
What I think is needed prior to merging this PR:
- Gather any more feedback on the API
- Add tests (in addition to the example) in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/user_defined (the example outputs are not validated)
I also added some suggestions on how to make the code simpler / better but I don't think they are strictly necessary
} | ||
|
||
/// Get the function implementation and generate a table | ||
pub fn create_table_provider(&self, args: &[Expr]) -> Result<Arc<dyn TableProvider>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// A trait for table function implementations | ||
pub trait TableFunctionImpl: Sync + Send { | ||
/// Create a table provider | ||
fn call(&self, args: &[Expr]) -> Result<Arc<dyn TableProvider>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is nice and is as specified in #7926. I think it will work for using a table function as a relation in the query (aka like a table with parameters)
The one thing I don't think this API supports is TableFunctions that take other arguments (aka that are fed the result of a table / can use the value of correlated subqueries as mentioned by @yukkit and @Jesse-Bakker #7926 (comment).
I can think of two options:
- Leave this API as is , and add a follow on / new API somehow to support that usecase
- Try to extend this API somehow to support table inputs
I personally prefer 1 as I think it offers several additional use cases, even though it doesn't cover "take a table input".
Any other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One specific use case for table-valued arguments to table-valued functions is, for example windowing tvf's like in apache flink.
Example which cannot be expressed by taking Expr
arguments (maybe if Expr::Row()
is added?):
SELECT window_start, window_end, SUM(price)
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
That can also be emulated, however, using something like:
SELECT window_start, window_end, SUM(price)
FROM Bid,
TUMBLE(Bid.bidtime, INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
which doesn't need table-valued arguments (but does need to resolve Expr::Column(name=bidtime)
. I'm not sure if the current API can do that?).
Anyway, the current API is nice, and definitely very useful 👍
datafusion/sql/src/relation/mod.rs
Outdated
FunctionArg::Unnamed(FunctionArgExpr::Expr(expr)) => { | ||
let expr = self.sql_expr_to_logical_expr( | ||
expr, | ||
// TODO(veeupup): for now, maybe it's little diffcult to resolve tables' schema before create provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the function is used like a base relation (aka a TableFactor / item in the FROM
clause) I don't think there is any schema to use to resolve the input arguments (aka this code is correct and probably shouldn't be a todo).
I don't really know how we would represent the correlated subquery case in #7926 (comment) 🤔 I think it would need some sort of analysis pass after the initial plan is built
} | ||
} | ||
|
||
// Option2: (use StreamingTable to make it simpler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think adding a second option is more confusing in this example, so I would recommend removing this part (perhaps you can just add a note that it is posisble)
Signed-off-by: veeupup <[email protected]>
Signed-off-by: veeupup <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/core/tests/user_defined/user_defined_table_functions.rs
Outdated
Show resolved
Hide resolved
Simply table function example, add some comments
we should support table function inputs with Where should we track it? Still track it in #7926 ? or start a new issue I can make some trials later. |
I took the liberty of merging this branch up from main. If the tests pass I'll plan to merge it. While waiting I will likely file follow on tickets |
I filed #8383 to track follow on work. Is there any other specific improvement that we discussed that isn't covered? |
🚀 |
* Support User Defined Table Function Signed-off-by: veeupup <[email protected]> * fix comments Signed-off-by: veeupup <[email protected]> * add udtf test Signed-off-by: veeupup <[email protected]> * add file header * Simply table function example, add some comments * Simplfy exprs * make clippy happy * Update datafusion/core/tests/user_defined/user_defined_table_functions.rs --------- Signed-off-by: veeupup <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #7926
Rationale for this change
This is a draft PR to show how to implement User-Defined-Table Function.It remains a lot of things to do, but maybe it's a good start.
Such as:
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?