Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 3, 2025

Which issue does this PR close?

Closes #13949.

Rationale for this change

When an aggregation operator spills intermediate (partial) state to disk, it needs a schema that includes both the group-by columns and partial-aggregator columns (e.g., partial sums, counts, etc.). Previously, the code used the original input schema for spilling, which does not match the additional columns representing aggregator states. As a result, reading back the spilled data caused a mismatch error:

ArrowError(InvalidArgumentError(
  "number of columns(3) must match number of fields(2) in schema"
))

This PR addresses that by introducing a partial aggregation schema that combines group columns and aggregator state columns, ensuring consistency when spilling and later reading the spilled data.

What changes are included in this PR?

  1. A new helper function, build_partial_agg_schema(), creates a partial schema by merging:
  • Group-by fields
  • Each aggregator’s internal “state fields”
  1. The aggregate operator is updated to use this partial schema when spilling or merging spilled data rather than the original (input) schema, which fixes the column mismatch error.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 3, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Jan 3, 2025
@kosiew kosiew marked this pull request as ready for review January 3, 2025 06:43
@kosiew kosiew changed the title Refactor spill handling in GroupedHashAggregateStream to use partial … Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream Jan 3, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I found the fix easy to follow 😄, and the change makes sense to me.

I have a suggestion to improve test coverage:
Since min/max only has one intermediate aggregate state (partial min/max), we should also test aggregate functions that produce more than one intermediate state, like avg (partial sum and count).
Duplicating the existing test and modifying one of the aggregate functions to avg should be sufficient.


let result =
common::collect(single_aggregate.execute(0, Arc::clone(&task_ctx))?).await?;

Copy link
Contributor

@2010YOUY01 2010YOUY01 Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add an assertion here to make sure spilling actually happened for certain test cases. Like:

        let metrics = single_aggregate.metrics();
        // ...and assert some metrics inside like 'spill count' is > 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema error when spilling with multiple aggregations
2 participants