-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I found the fix easy to follow 😄, and the change makes sense to me.
I have a suggestion to improve test coverage:
Since min/max
only has one intermediate aggregate state (partial min/max), we should also test aggregate functions that produce more than one intermediate state, like avg
(partial sum and count).
Duplicating the existing test and modifying one of the aggregate functions to avg should be sufficient.
|
||
let result = | ||
common::collect(single_aggregate.execute(0, Arc::clone(&task_ctx))?).await?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add an assertion here to make sure spilling actually happened for certain test cases. Like:
let metrics = single_aggregate.metrics();
// ...and assert some metrics inside like 'spill count' is > 0
Which issue does this PR close?
Closes #13949.
Rationale for this change
When an aggregation operator spills intermediate (partial) state to disk, it needs a schema that includes both the group-by columns and partial-aggregator columns (e.g., partial sums, counts, etc.). Previously, the code used the original input schema for spilling, which does not match the additional columns representing aggregator states. As a result, reading back the spilled data caused a mismatch error:
This PR addresses that by introducing a partial aggregation schema that combines group columns and aggregator state columns, ensuring consistency when spilling and later reading the spilled data.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No