-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group by high cardinality column in datafusion 10 times slower than low cardinality column #1246
Comments
If I recall correctly, datafusion doesn't do fine optimization about |
i try to dig code in trino and doris; there are all have streaming aggregate node; but i can't understand how they working.
low cardinality:
high cardinality:
|
I'll take a look at Doris on the weekend. Until then, we can wait for someone else to answer your questions. Thanks for your comparison @jiangzhx |
Accidentally closed |
I think there is a lot of overhead creating and managing group keys via |
Related PR: #6657 |
see #4973 (comment) for proposal |
This should be closed by #6904 |
Describe the bug
group by high cardinality column in datafusion 10 times slower than low cardinality column.
also i tested on other olap engine, there are only 2 times slow or less;
trino olap engine write by java
doris olap engine write by c++
To Reproduce
Steps to reproduce the behavior:
parquet table with 60,000,000 rows; data generate by ssb-dbgen
group by LO_ORDERPRIORITY
group by S_ADDRESS
Expected behavior
should some with other engine;
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: