Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate constant columns from sort requirements #9612

Closed
suremarc opened this issue Mar 14, 2024 · 4 comments · Fixed by #9649
Closed

Eliminate constant columns from sort requirements #9612

suremarc opened this issue Mar 14, 2024 · 4 comments · Fixed by #9649
Assignees
Labels
enhancement New feature or request

Comments

@suremarc
Copy link
Contributor

suremarc commented Mar 14, 2024

Is your feature request related to a problem or challenge?

I have a table sorted by (date, ticker, time) and would like to query data ordered by (date, time) assuming ticker is known to be constant.

It seems that the existing EquivalenceProperties code is built to allow this, but the constant properties are getting lost inside of FilterExec. Furthermore, it seems that FilterExec would propagate the necessary constant property if not for the following check failing:

https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-plan/src/filter.rs#L127-L135

which ultimately boils down to this piece of code only allowing numeric types:

https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-expr/src/intervals/utils.rs#L96-L111

I noticed that if we just add DataType::Utf8 to the list, it seems to work for the example below (in the next section), so my uneducated mind wonders if we can just add more datatypes to it? 🤔

Describe the solution you'd like

Given the following table:

CREATE EXTERNAL TABLE data (
    "date"   VARCHAR, 
    "ticker" VARCHAR, 
    "time"   VARCHAR,
) STORED AS PARQUET
WITH ORDER ("date", "ticker", "time")
LOCATION 'example/';

INSERT INTO data VALUES 
('2023', 'A', '0400'), 
('2023', 'A', '0401'), 
('2023', 'B', '0400');

I would like for the following query to be executed without a sort:

SELECT * FROM data 
WHERE ticker = 'A' 
ORDER BY "date", "time";
-- +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-- | plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                         -- |
-- +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-- | logical_plan  | Sort: data.date ASC NULLS LAST, data.time ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                     -- |
-- |               |   Filter: data.ticker = Utf8("A")                                                                                                                                                                                                                                                                                                                                                            -- |
-- |               |     TableScan: data projection=[date, ticker, time], partial_filters=[data.ticker = Utf8("A")]                                                                                                                                                                                                                                                                                               -- |
-- | physical_plan | SortPreservingMergeExec: [date@0 ASC NULLS LAST,time@2 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                       -- |
-- |               |   SortExec: expr=[date@0 ASC NULLS LAST,time@2 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                               -- |
-- |               |     CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                              -- |
-- |               |       FilterExec: ticker@1 = A                                                                                                                                                                                                                                                                                                                                                               -- |
-- |               |         RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                                                                                                                                                                                                                                                                -- |
-- |               |           ParquetExec: file_groups={1 group: [[Users/matthewcramerus/arrow-datafusion/datafusion-cli/example/bGDu3tzuQaHhgpGZ_0.parquet]]}, projection=[date, ticker, time], output_ordering=[date@0 ASC NULLS LAST, ticker@1 ASC NULLS LAST, time@2 ASC NULLS LAST], predicate=ticker@1 = A, pruning_predicate=ticker_min@0 <= A AND A <= ticker_max@1, required_guarantees=[ticker in (A)] |
-- |               |                                                                                                                                                                                                                                                                                                                                                                                              -- |
-- +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Describe alternatives you've considered

I initially made a patch to FilterExec that uses the LiteralGuarantee API to extract constant columns, but then I realized I was doing what the AnalyzsisContext does, just worse and not in a way that respects statistics propagated up from child plans.

Additional context

No response

@suremarc suremarc added the enhancement New feature or request label Mar 14, 2024
@Lordworms
Copy link
Contributor

It seems like another equivalence-related issue I have done before, I'll look into this one.

@Lordworms
Copy link
Contributor

take

@Lordworms
Copy link
Contributor

Seems like combined with your PR, the results children[0].data would always be false, I'll try do figure it out then I think the issue would be worked out
image

@Lordworms
Copy link
Contributor

Current plan is to substitute equivalence in FilterExec and when Calculate equivalence and the filer expression has a EqExpr and left parts are Sorted from its child also the right part is a constant, we add that expression to the constant vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants