Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that COUNT(DISTINCT) on StringView panics #11768

Merged
merged 2 commits into from
Aug 1, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #11767

Rationale for this change

It is a typo

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Thank you @XiangpengHao -- I ahve some tests written for this that I will push to this branch

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very quick turnaround @XiangpengHao

}
DataType::LargeUtf8 => {
Box::new(BytesDistinctCountAccumulator::<i64>::new(OutputType::Utf8))
}
DataType::Binary => Box::new(BytesDistinctCountAccumulator::<i32>::new(
OutputType::Binary,
)),
DataType::BinaryView => Box::new(BytesViewDistinctCountAccumulator::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@XiangpengHao
Copy link
Contributor Author

Do you @alamb have any thoughts on why clickbench query 5 won't previously panic?

SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;

I find only when we count two disctincts will go to this path, for example:

SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry") FROM hits;

This make me suspect that Q5 is being casted to Utf8, which explains why it is slower with string view... looking at it, let me know if you have any ideas..

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Do you @alamb have any thoughts on why clickbench query 5 won't previously panic?

Yes, I think it is because

SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;

Is rewritten to a group by without distinct here: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/single_distinct_to_groupby.rs

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads/benchmarking$ datafusion-cli -c 'EXPLAIN SELECT COUNT(DISTINCT "SearchPhrase") FROM "hits.parquet"'
DataFusion CLI v40.0.0
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: count(alias1) AS count(DISTINCT hits.parquet.SearchPhrase)                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |     Aggregate: groupBy=[[hits.parquet.SearchPhrase AS alias1]], aggr=[[]]                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |       TableScan: hits.parquet projection=[SearchPhrase]                                                                                                                                                                                                                                                                                                                                                                                                                                |
| physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT hits.parquet.SearchPhrase)]                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |     CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |       AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |         AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |             RepartitionExec: partitioning=Hash([alias1@0], 16), input_partitions=16                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |               AggregateExec: mode=Partial, gby=[SearchPhrase@0 as alias1], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |                 ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/benchmarking/hits.parquet:0..923748528], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:923748528..1847497056], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:1847497056..2771245584], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:2771245584..3694994112], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:3694994112..4618742640], ...]}, projection=[SearchPhrase] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.046 seconds.

This was largely needed before we had implemented native distinct accumulators. But I wonder if we should re-evaluate now that we have fast string accumulators 🤔

@XiangpengHao
Copy link
Contributor Author

Is rewritten to a group by without distinct here

I see, that makes sense

@XiangpengHao
Copy link
Contributor Author

This was largely needed before we had implemented native distinct accumulators. But I wonder if we should re-evaluate now that we have fast string accumulators 🤔

I ran a simple test and found that this optimization rule indeed improves performance

@Dandandan Dandandan merged commit f044bc8 into apache:main Aug 1, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COUNT(DISTINCT) on StringView panics: unreachable code: Utf8/Binary should use ArrowBytesSet
3 participants