-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft v2] Another Multi group by optimization #10976
Conversation
Great news! Although this approach is slightly slow on small string, but it is significant better if string is large. THIS PR
Main
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool @jayzhan211 -- exactly the right idea I think
SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100; | ||
SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000; | ||
SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000; | ||
SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever way to quickly iterate for benchmark tests 👍
); | ||
|
||
|
||
let u64_vec: Vec<u64> = groups.iter().map(|&x| x as u64).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be faster to write into u64_vec
directly rather than write to groups and then copy over
|
||
|
||
// Index Array: [0, 1, 1, 0] | ||
// Data Array: ['a', 'c'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code copies all the strings again, which likely takes significant time
One way to make this faster is likey to special case the "has no nulls" path (so we can avoid if let Some(.)
check each row
The only other ways I can think to make this faster is to avoid copying the strings all together. The only way to do so I can figure are:
- Return a DictionaryArray as output
- Return a
StringViewArray
(similar to DictionayArray) [Epic] Implement support forStringView
in DataFusion #10918
🤔
Maybe we can try to hack in the ability to have the HashAggregateExec return Dictionary(Int32, String)
for these multi-column groups and see if we can show significant performance improvements
If so then we can figure out how to thread that ability through the engine.
BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to |
I am starting to play with this PR again |
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #.
Related to #9403
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?