-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug that COUNT(DISTINCT)
on StringView panics
#11768
Conversation
Thank you @XiangpengHao -- I ahve some tests written for this that I will push to this branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the very quick turnaround @XiangpengHao
} | ||
DataType::LargeUtf8 => { | ||
Box::new(BytesDistinctCountAccumulator::<i64>::new(OutputType::Utf8)) | ||
} | ||
DataType::Binary => Box::new(BytesDistinctCountAccumulator::<i32>::new( | ||
OutputType::Binary, | ||
)), | ||
DataType::BinaryView => Box::new(BytesViewDistinctCountAccumulator::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Do you @alamb have any thoughts on why clickbench query 5 won't previously panic? SELECT COUNT(DISTINCT "SearchPhrase") FROM hits; I find only when we count two disctincts will go to this path, for example: SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry") FROM hits; This make me suspect that Q5 is being casted to Utf8, which explains why it is slower with string view... looking at it, let me know if you have any ideas.. |
Yes, I think it is because
Is rewritten to a group by without distinct here: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/single_distinct_to_groupby.rs
This was largely needed before we had implemented native distinct accumulators. But I wonder if we should re-evaluate now that we have fast string accumulators 🤔 |
I see, that makes sense |
I ran a simple test and found that this optimization rule indeed improves performance |
Which issue does this PR close?
Closes #11767
Rationale for this change
It is a typo
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?