-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support (order by / sort) for DataFrameWriteOptions #13874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- this PR is really nice in my mind -- it is well commented and well tested 🏅
|
||
assert_batches_eq!( | ||
&[ | ||
"+---+---+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good test 👍
.sort(options.sort_by)? | ||
.build()? | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't LogicalPlanBuilder::copy_to
have the ordering info as well?
If we write the info to a new table, do we know avoid sorting the table again? We should expose the ordering info to the newly created table for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copy_to inserts the data from whatever the input plan is (if there is a redundant sort the optimizer will remove it). Is that what you are asking?
It is an interesting question of how to communicate "this table is sorted" information for newly written files. Is this what you are talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I mean that after this PR we seem we don't communicate "we are sorted" yet, which is I think would support usecases such as in the issue from @TheBuilderJR (write data sorted on a column, making following read query with order by
on the same column fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb @Dandandan for review, i am also interested how we can communicate "this table is sorted" information for newly written files. I will also investigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ordering information is normally handled by a higher level "catalog" rather than the parquet format itself.
This is the only thing I know of in Parquet, but I don't think it can describe the ordering
Maybe this is something that https://iceberg.apache.org/ could represent 🤔
It is also conceivable that DataFusion itself could write custom metadata in paquet and other formats that support that custom metadata with the ordering, but that seems like we would just be reinventing Iceberg and similar table formats
bfa1b6f
to
468b2aa
Compare
Updated the code to also include the write_table testing case which is the same with insert into sql. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this PR is a step forward as at least now it is easier to configure the writer desires a particular sorted output
However, as @Dandandan points out, while this might make it easier to find ordering during writing DataFusion can already do this via the DataFrame API (which is actually what this PR does under the covers) but there is no way to communicate the ordering of a file back with the existing listing table implementation
I recommend we file a follow on issue to discuss that usecase if people want to pursue it
Thank you @alamb @Dandandan for review, it makes sense we continue investigating the solution for communicating the ordering of a file back with the existing listing table implementation. I created a follow-up issue for this: |
Thank you @zhuqi-lucas |
Which issue does this PR close?
Closes #13873
Rationale for this change
DataFrameWriteOptions is missing an order by / sort by like available in SQL.
For sql we have the option to sort, for example:
You can use the WITH ORDER clause of the CREATE EXTERNAL TABLE if your data is already ordered
https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
But for writing my parquet or other format files, we don't support it
What changes are included in this PR?
Add the sort support.
Are these changes tested?
yes
Added unit testing.
Are there any user-facing changes?
Yes, we support new order option for DataFrameWriteOptions