Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

kosiew · 2025-01-03T06:04:49Z

Which issue does this PR close?

Rationale for this change

When an aggregation operator spills intermediate (partial) state to disk, it needs a schema that includes both the group-by columns and partial-aggregator columns (e.g., partial sums, counts, etc.). Previously, the code used the original input schema for spilling, which does not match the additional columns representing aggregator states. As a result, reading back the spilled data caused a mismatch error:

ArrowError(InvalidArgumentError(
  "number of columns(3) must match number of fields(2) in schema"
))

This PR addresses that by introducing a partial aggregation schema that combines group columns and aggregator state columns, ensuring consistency when spilling and later reading the spilled data.

What changes are included in this PR?

A new helper function, build_partial_agg_schema(), creates a partial schema by merging:

Group-by fields
Each aggregator’s internal “state fields”

The aggregate operator is updated to use this partial schema when spilling or merging spilled data rather than the original (input) schema, which fixes the column mismatch error.

Are these changes tested?

Yes

Are there any user-facing changes?

No

…aggregate schema

… GroupedHashAggregateStream

2010YOUY01

Thank you. I found the fix easy to follow 😄, and the change makes sense to me.

I have a suggestion to improve test coverage:
Since min/max only has one intermediate aggregate state (partial min/max), we should also test aggregate functions that produce more than one intermediate state, like avg (partial sum and count).
Duplicating the existing test and modifying one of the aggregate functions to avg should be sufficient.

2010YOUY01 · 2025-01-04T16:06:25Z

datafusion/core/src/dataframe/mod.rs

+
+        let result =
+            common::collect(single_aggregate.execute(0, Arc::clone(&task_ctx))?).await?;
+


I suggest to add an assertion here to make sure spilling actually happened for certain test cases. Like:

let metrics = single_aggregate.metrics(); // ...and assert some metrics inside like 'spill count' is > 0

Refactor spill handling in GroupedHashAggregateStream to use partial …

da2b11a

…aggregate schema

github-actions bot added the physical-expr Physical Expressions label Jan 3, 2025

kosiew added 2 commits January 3, 2025 14:12

Implement aggregate functions with spill handling in tests

01d2b60

Merge branch 'main' into fix-spill

e094adb

github-actions bot added the core Core DataFusion crate label Jan 3, 2025

Add tests for aggregate functions with and without spill handling

d066aff

kosiew marked this pull request as ready for review January 3, 2025 06:43

kosiew changed the title ~~Refactor spill handling in GroupedHashAggregateStream to use partial …~~ Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream Jan 3, 2025

kosiew added 4 commits January 3, 2025 16:32

Move test related imports into mod test

04d9123

Rename spill pool test functions for clarity and consistency

242f5ab

Refactor aggregate function imports to use fully qualified paths

270efd7

Remove outdated comments regarding input batch schema for spilling in…

38ade08

… GroupedHashAggregateStream

2010YOUY01 reviewed Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

kosiew commented Jan 3, 2025

2010YOUY01 left a comment

2010YOUY01 Jan 4, 2025 •

edited

Loading


		let result =
		common::collect(single_aggregate.execute(0, Arc::clone(&task_ctx))?).await?;

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

Are you sure you want to change the base?

Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995

Conversation

kosiew commented Jan 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

2010YOUY01 left a comment

Choose a reason for hiding this comment

2010YOUY01 Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

2010YOUY01 Jan 4, 2025 •

edited

Loading