Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to communicate the ordering of a file back with the existi… #13933

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

…ng listing table implementation

Which issue does this PR close?

close issue_13891

Rationale for this change

This is the follow-up for:
#13874 (review)

We add support (order by / sort) for DataFrameWriteOptions, but when a user try to query the table which the file already ordered, we can't get info from the table.

We need to find a way to communicate the ordering of a file back with the existing listing table implementation.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, user can automatically optimize the sort column without add config.

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate common Related to common crate proto Related to proto crate labels Dec 28, 2024
@zhuqi-lucas zhuqi-lucas marked this pull request as draft December 28, 2024 08:52
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 28, 2024
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review December 29, 2024 08:14
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 29, 2024

Hi @alamb @Dandandan @TheBuilderJR, i submitted the first version PR for the automatically detecting sorted parquet file order and using the info to optimize for the plan. It's a very basic PR, we can add more follow-up issues to improve it.

I have more questions and will try to create follow-up issues after this PR, for example:

  1. Do we need to add a new function something like:
    df.is_sorted( "col1")
    This may get from the metadata?
  2. Now the sql copy command does not support write with sort order, we may can add copy to support write parquet with order, and the info also can write to metadata?
  3. Validation for the complex or invalid sort by cases, such as sort by ((a+1)*5 + b)?
  4. We only using the first file sort metadata to expose to statistic, we need to consider if we need to handle for multi file cases, each one file may has different sort order?
  5. Do we need to add more docs and use case for this new feature?

@zhuqi-lucas
Copy link
Contributor Author

This PR seems also can resolve the issue:
#4177

@ozankabak
Copy link
Contributor

I haven't reviewed this PR carefully yet, but we already have mechanisms to propagate source ordering. Why do we need to add this information to Statistics?

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 29, 2024

Hi @ozankabak , thank you for review, we can propagate source ordering with:

  1. Create external table with order
  2. Create a parquet read with list options to setting the order when reading a parquet table

But when we write a parquet file with order and the table without order info (when creating, we don't add with order), we can't propagate the order to the table, we need to write the order to parquet metadata. And when we scan table without sort option setting, we can get the metadata for optimization.

For the Statistics to store the info, do you mean we can just load the metadata order to somewhere else?

@berkaysynnada
Copy link
Contributor

I believe extending the Statistics with sort information is dangerous, as it deviates from the single-responsibility principle and creates the burden of maintaining order information in two places (Statistics and equivalences).

I wonder if we can utilize the sorting_columns() API of WriterProperties and WriterPropertiesBuilder instead. Also I guess giving this info should be on TableParquetOptions rather than DataFrameWriteOptions. So sort_by in DataFrameWriteOptions would be redundant.

We are currently working on an extensive refactor of the Statistics framework, so both in its current state and the new version, storing order information in it does not seem the right way. We need to address this issue in a more seamless way. I also don’t think this should be a fork-specific implementation, as it’s a common need, but we need to find a smoother approach.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 30, 2024

Thank you @berkaysynnada for review and good suggestions.
I agree with you, this is not a good way to add the info to statistics, i will try to do following improvement for this PR:

  1. Try to make the maintaining order information unified to equivalences.
  2. About the sorting_columns(), it's a row group level info, not file level FileMetaData. I am not sure if it's a better way.
    But it looks like a unified column for sort info, we may also can benefit from other engine written sort column?
    I will also try to use it, and for our first version, we only support one order for the parquet file, so we can just get any of the row group to get the order info.
/// Sort order within a RowGroup of a leaf column
#[derive(Clone, Debug, Eq, Hash, Ord, PartialEq, PartialOrd)]
pub struct SortingColumn {
  /// The ordinal position of the column (in this row group) *
  pub column_idx: i32,
  /// If true, indicates this column is sorted in descending order. *
  pub descending: bool,
  /// If true, nulls will come before non-null values, otherwise,
  /// nulls go at the end.
  pub nulls_first: bool,
}
  1. Try to make the TableParquetOptions has the only sort_by order info.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Hi @zhuqi-lucas -- sorry if we caused confusion here. I agree with @berkaysynnada and @ozankabak that ordering information is already represented in plans using EquivalenceProperties

The idea is to store the output sorted order in the parquet file

So this would look something like:

  1. COPY (SELECT ... ORDER BY x, y) to 'foo.parquet'
  2. foo.parquet has some new metadata like DATAFUSION_ORDER_BY = x ASC, y ASC`
  3. On the next SELECT ... FROM 'foo.parquet' the ParquetExec would look for the DATAFUSION_ORDER_BY metadata in the file and if present update the EquivalenceProperties to reflect an ordering x, y

Does that make sense?

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 31, 2024

Thank you @alamb for clarify, it makes sense, let me try to change the PR based our discussion conclusion.

@github-actions github-actions bot removed physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Dec 31, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 2, 2025
@zhuqi-lucas
Copy link
Contributor Author

Hi @zhuqi-lucas -- sorry if we caused confusion here. I agree with @berkaysynnada and @ozankabak that ordering information is already represented in plans using EquivalenceProperties

The idea is to store the output sorted order in the parquet file

So this would look something like:

  1. COPY (SELECT ... ORDER BY x, y) to 'foo.parquet'
  2. foo.parquet has some new metadata like DATAFUSION_ORDER_BY = x ASC, y ASC`
  3. On the next SELECT ... FROM 'foo.parquet' the ParquetExec would look for the DATAFUSION_ORDER_BY metadata in the file and if present update the EquivalenceProperties to reflect an ordering x, y

Does that make sense?

Hi @alamb , the PR is ready for review now, i addressed the comments, also added unit testing and slt testing cases.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find a way to communicate the ordering of a file back with the existing listing table implementation
4 participants