-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a way to communicate the ordering of a file back with the existi… #13933
base: main
Are you sure you want to change the base?
Conversation
f43782a
to
dd2554d
Compare
dd2554d
to
f59c31d
Compare
Hi @alamb @Dandandan @TheBuilderJR, i submitted the first version PR for the automatically detecting sorted parquet file order and using the info to optimize for the plan. It's a very basic PR, we can add more follow-up issues to improve it. I have more questions and will try to create follow-up issues after this PR, for example:
|
This PR seems also can resolve the issue: |
I haven't reviewed this PR carefully yet, but we already have mechanisms to propagate source ordering. Why do we need to add this information to |
Hi @ozankabak , thank you for review, we can propagate source ordering with:
But when we write a parquet file with order and the table without order info (when creating, we don't add with order), we can't propagate the order to the table, we need to write the order to parquet metadata. And when we scan table without sort option setting, we can get the metadata for optimization. For the Statistics to store the info, do you mean we can just load the metadata order to somewhere else? |
I believe extending the Statistics with sort information is dangerous, as it deviates from the single-responsibility principle and creates the burden of maintaining order information in two places (Statistics and equivalences). I wonder if we can utilize the We are currently working on an extensive refactor of the Statistics framework, so both in its current state and the new version, storing order information in it does not seem the right way. We need to address this issue in a more seamless way. I also don’t think this should be a fork-specific implementation, as it’s a common need, but we need to find a smoother approach. |
Thank you @berkaysynnada for review and good suggestions.
/// Sort order within a RowGroup of a leaf column
#[derive(Clone, Debug, Eq, Hash, Ord, PartialEq, PartialOrd)]
pub struct SortingColumn {
/// The ordinal position of the column (in this row group) *
pub column_idx: i32,
/// If true, indicates this column is sorted in descending order. *
pub descending: bool,
/// If true, nulls will come before non-null values, otherwise,
/// nulls go at the end.
pub nulls_first: bool,
}
|
Hi @zhuqi-lucas -- sorry if we caused confusion here. I agree with @berkaysynnada and @ozankabak that ordering information is already represented in plans using
The idea is to store the output sorted order in the parquet file So this would look something like:
Does that make sense? |
Thank you @alamb for clarify, it makes sense, let me try to change the PR based our discussion conclusion. |
506c833
to
079a59e
Compare
Hi @alamb , the PR is ready for review now, i addressed the comments, also added unit testing and slt testing cases. Thanks! |
…ng listing table implementation
Which issue does this PR close?
close issue_13891
Rationale for this change
This is the follow-up for:
#13874 (review)
We add support (order by / sort) for DataFrameWriteOptions, but when a user try to query the table which the file already ordered, we can't get info from the table.
We need to find a way to communicate the ordering of a file back with the existing listing table implementation.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, user can automatically optimize the sort column without add config.