-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Support cache ListFiles result cache in session level #7620
Conversation
opt: &ListingOptions, | ||
) -> ListingTable { | ||
let schema = opt.infer_schema(state1, table_path).await.unwrap(); | ||
let schema = opt | ||
.infer_schema(&SessionState::default(), table_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here infer_schema will call list_files as well, so create a new Session state here
@@ -94,9 +94,69 @@ impl CacheAccessor<Path, Arc<Statistics>> for DefaultFileStatisticsCache { | |||
} | |||
} | |||
|
|||
/// Collected files metadata for listing files. | |||
/// Cache will not invalided until user call remove or clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here keep simple, user can use their own notify system to decide call remove
or clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me @Ted-Jiang -- thank you. The only thing I think needs to be addressed prior to merge is the addition of SessionState::default
@@ -1405,6 +1405,15 @@ pub fn default_session_builder(config: SessionConfig) -> SessionState { | |||
SessionState::with_config_rt(config, Arc::new(RuntimeEnv::default())) | |||
} | |||
|
|||
impl Default for SessionState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that providing a Default for SessionState might allow subtle bugs, specifically accidentally running queries or other operations without properly threading through the configuration or runtime. I believe that is why there is no SessionState::new()
or Default
🤔 this context would be nice to add as a comment to SessonState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it there are many options to set in SessionState
, i just want to write less code when construct SessionState see there are already default in SessionConfig
and RuntimeEnv
😂 , it make sense let user point what conf they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the Default
impl and instead have a function for tests (test_session_state()
or similar) to avoid misuse of this API.
Here is a PR to add some comments about the rationale for not having a Default
impl: #7654
I was thinking a little about the cache APIs here -- at the moment these are all related to This is not a suggested change for this PR, just something I was thinking about. |
Thanks for your kindly review @alamb ! Yes the cache method here are all in As this is cache , it need share the cache result in some place, i think there is only btw i see other system use global value to store global cache, is there a way datafusion use this method |
The way IOx does this is to setup a I think you can do something similar with the caches in this PR |
Are there any advantages to caching file listing in DataFusion instead of inside the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- I think we should remove the Default
impl for SessionState
from this PR
@@ -1405,6 +1405,15 @@ pub fn default_session_builder(config: SessionConfig) -> SessionState { | |||
SessionState::with_config_rt(config, Arc::new(RuntimeEnv::default())) | |||
} | |||
|
|||
impl Default for SessionState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the Default
impl and instead have a function for tests (test_session_state()
or similar) to avoid misuse of this API.
Here is a PR to add some comments about the rationale for not having a Default
impl: #7654
This is an excellent point @suremarc -- what do you think @Ted-Jiang of making a Caching |
Converting to draft as it is not waiting on a review -- we are now working through comments |
Arc::new(memory) | ||
( | ||
Arc::new(memory), | ||
SessionState::with_config_rt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the default of SeesionState
@alamb @suremarc Sorry for the delay.
As i have check the some remote store client like |
@alamb could you take a look plz |
I agree it would be more difficult, having implemented some caching of my own recently. A full implementation would require dealing with invalidation/consistency, particularly for update operations, as well as supporting nontrivial APIs like On that note, I'd like to point out that this cache implementation won't avoid hitting object storage if your table has partition columns, because in that case DataFusion will instead call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang and @suremarc
Do you have any performance measurements you can shar @Ted-Jiang about how much this feature increases performance for your usecase?
Co-authored-by: Andrew Lamb <[email protected]>
We try to fix the situation when calling remote storage list file statistics sometimes are not stable:
we saw half time cost in create physical plan 🤣 (btw we pass logic plan LIST_TABLE to datafusion in out front end (written in JAVA)) after enable cache we fix this problem the query time keep stable. btw we have fix the consistency in our java side, because out sys build kind of materialized view, if we we change the source data the path as the cache key also change.
@suremarc thanks for point this out 👍 i will figure out this later. |
Thanks again @Ted-Jiang and @suremarc |
Which issue does this PR close?
Closes #7618.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?