-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract catalog API to separate crate, change TableProvider::scan
to take a trait rather than SessionState
#11516
Conversation
This tries to implement the idea expressed in #10782 (comment) |
d5f5a77
to
5bbf5d7
Compare
8ad5111
to
0430691
Compare
0430691
to
6c399c2
Compare
Eventually all green. @jayzhan211 @alamb @comphead you may want to take a look |
I think |
@jayzhan211 admittedly i made some choices about what to put into CatalogSession (or however we want to call this) and what to keep outside requiring a downcast to SessionState. Of course, these choices remain pretty arbitrary. there may be things that CatalogSession should have but doesn't have yet, and also things that it got, but should be left out. |
I think this is similar to minimum dependencies what I'm thinking. The only difference is that I propose IMO struct is more sutiable for storing data, unlike trait that mainly used for sharing same behavior Given the dependency of TableProvider I found, we might need to find out the smaller context for FileFormatFactory and QueryPlanner. trait TableProvider {
async fn scan(
&self,
context: &TableProviderContext,
projection: Option<&Vec<usize>>,
filters: &[Expr],
limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>>;
}
struct TableProviderContext {
FileFormatFactory
SessionConfig
ExectuionProps
RuntimeEnv
QueryPlanner
}
pub trait QueryPlanner {
/// Given a `LogicalPlan`, create an [`ExecutionPlan`] suitable for execution
async fn create_physical_plan(
&self,
logical_plan: &LogicalPlan,
context: &QueryPlannerContext,
) -> Result<Arc<dyn ExecutionPlan>>;
}
pub trait FileFormatFactory: Sync + Send + GetExt {
/// Initialize a [FileFormat] and configure based on session and command level options
fn create(
&self,
context: &FileFormatContext,
format_options: &HashMap<String, String>,
) -> Result<Arc<dyn FileFormat>>;
/// Initialize a [FileFormat] with all options set to default values
fn default(&self) -> Arc<dyn FileFormat>;
} But before that, we need to ensure those contexts won't end up the as the same thing in the long term |
A publicly constructible struct is indeed different than a trait. Not sure we want that though, as it would limit evolvability. IMO a bigger question is what functionality does this new being provide. cc @alamb it would be good to hear your opinion, especially that you created the #10782 issue, which this PR is in response to. |
Given the current QueryPlanner is much complex, there is
|
At runtime there are circular dependencies, unless we further limit amount of functionality available to the |
I doubt that |
I think there is clear dependency something close to // queryplanner crate
trait QueryPlanner {}
impl QueryPlanner for DefaultPlanner {}
// fileformat crate
trait FileFormat {}
impl FileFormat for DefaultFile {
// FileFormatContext that has all the higher level concept things above FileFormat
fn scan(&self, FileFormatContext) -> Arc<QueryPlanner> { // able to downcast to planner }
}
// Table crate
trait TableProvider {}
impl TableProvider for DefaultTable {
// arguments are all higher level trait without circular dependency
fn get_file(&self, plan) -> Arc<QueryPlanner> { // able to downcast to actual planner }
fn get_planner(&self, fileformat) -> Arc<FileFormat> { // able to downcast to actual format}
// TableProviderContext that has all the higher level concept things above TableProvider
fn scan(&self, TableProviderContext)
} Here is one example that I think we should fixed. We find the specific table from LogicalPlan::Dml(DmlStatement {
table_name,
op: WriteOp::InsertInto,
..
}) => {
let name = table_name.table();
let schema = session_state.schema_for_ref(table_name.clone())?;
if let Some(provider) = schema.table(name).await? {
let input_exec = children.one()?;
provider
.insert_into(session_state, input_exec, false)
.await?
} else {
return exec_err!("Table '{table_name}' does not exist");
}
} Instead, I think we should provider
Not sure if this idea works over all the places 🤔 |
I plan to check this PR out carefully tomorrow |
TableProvider::scan
to take a trait rather than SessionState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @findepi and @jayzhan211
I think this API is quite clever and solves a long standing problem (circular dependency between Catalog
-> TableProvider
-> SessionState
). Like many elegant solutions, once you see it the design is obvious but it was not before seeing.
I also think this API makes the usecase described in #11193 from @cisaacson (providing the same SessionState information via more APIs) straightforward.
I think using a trait
(rather than a struct
) to break the dependency follows the same pattern used elsewhere in DataFusion (e.g. ScalarUdfImpl
)
Next Steps
First I would like to hear what @jayzhan211 thinks
If he thinks we should proceed, since this PR proposes to change a very widely used API in DataFusion I think some extra communication is warranted before we merge:
- mark this PR as
api-change
and change the title to higlight the change toTableProvider
- communicate this proposal over the mailing list / slack / discord to maximize the chance anyone with feedback to share has a chance to
- Make sure it is left open for several days for feedback
- Test this change with some downstream users (e.g. I can test this with InfluxDB 3.0 and see if the migration to the new API goes smoothly
datafusion/catalog/src/table.rs
Outdated
/// to return as a final result. | ||
async fn scan( | ||
&self, | ||
state: &dyn CatalogSession, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one line I think is the biggest potential change in this PR as it will require changing all existing TableProviders which is one of the oldest and most widely used APIs in DataFusion
If we are ok with this change, I think the rest of this PR is wonderful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked in the Spice AI codebase, and this will be a relatively minor refactoring for us. I am ok with this change 👍
Yes I agree with this
I also think this sounds like a reasonable plan - and I think it becomes easier as we split the logic out and defined the API via traits |
BTW I agree with most of @jayzhan211 's ideas above, but in my mind they are follow ons / orthogonal to moving the catalog traits out of the core (as in we can do both!) |
I am feeling even better about this PR. I propose we plan to merge it in 2 days time (Friday) unless there are objections or other comments to address. Thank you @findepi @Omega359 @cisaacson @jayzhan211 and @phillipleblanc for your help so far |
I slightly prefer However, I don't feel super strongly and would be interested in opinions from others if we want to have a |
re: SessionProvider i can change, this is not a big deal. @alamb @jayzhan211 @cisaacson @phillipleblanc @Omega359 let's just make sure we have agreement to avoid unnecessary back and forth. |
Oh no, I broke the doc test -- sorry @findepi |
64e6c20
to
9210b48
Compare
rebased on current main, no other changes |
I think the failure is due to new rust version (if you run Not related to the changes in this PR |
it's now fixed and the build is green, but there is a new conflict. will rebase |
This moves `CatalogProvider`, `TableProvider`, `SchemaProvider` to a new `datafusion-catalog` crate. The circular dependency between core `SessionState` and implementations is broken up by introducing `CatalogSession` dyn trait. Implementations of `TableProvider` that reside under core current have access to `CatalogSession` by downcasting. This is supposed to be an intermediate step.
a4d282a
to
dbb8905
Compare
( squashed and rebased, no other changes ) |
Let's do it. go go go 🚀 Thanks again everryone for your comments and help. I think this PR finally breaks open the path to separate out the last monolithic knot |
🎉 thanks for all the review feedback and the merge! |
Are we tracking this TODO?
|
@berkaysynnada yes, #11600 |
I suggest adding a comment in the code with the link to the ticket to help others find this ticket |
good idea! #11784 |
Catlog API was extracted to a separate crate. Ref: apache/datafusion#11516
Looks like this PR didn't get tagged with |
Sorry -- thanks for the heads up @phillipleblanc |
Catlog API was extracted to a separate crate. Ref: apache/datafusion#11516
* update datafusion deps to point to githuc.com/apache/datafusion Datafusion 41 is not yet released on crates.io. * update TableProvider::scan Ref: apache/datafusion#11516 * use SessionStateBuilder The old constructor is deprecated. Ref: apache/datafusion#11403 * update AggregateFunction Upstream Changes: - The field name was switched from `func_name` to func. - AggregateFunctionDefinition was removed Ref: apache/datafusion#11803 * update imports in catalog Catlog API was extracted to a separate crate. Ref: apache/datafusion#11516 * use appropriate path for approx_distinct Ref: apache/datafusion#11644 * migrate AggregateExt to ExprFunctionExt Also removed `sqlparser` dependency since it's re-exported upstream. Ref: apache/datafusion#11550 * update regr_count tests for new return type Ref: apache/datafusion#11731 * migrate from function-array to functions-nested The package was renamed upstream. Ref: apache/datafusion#11602 * cargo fmt * lock datafusion deps to 41 * remove todo from cargo.toml All the datafusion dependencies are re-exported, but I still need to figure out *why*.
This moves
CatalogProvider
,TableProvider
,SchemaProvider
to a newdatafusion-catalog
crate. The circular dependency between coreSessionState
and implementations is broken up by introducingCatalogSession
dyn trait. Implementations ofTableProvider
that reside under core current have access toCatalogSession
by downcasting. This is supposed to be an intermediate step.Part of #10782
Relates to #11420