Implement GroupsAccumulator for corr(x,y) aggregate function #13581

2010YOUY01 · 2024-11-27T15:29:53Z

Which issue does this PR close?

Rationale for this change

Implement GroupsAccumulator for corr aggregation function, for better performance when group cardinality is high

I rerun the H2o benchmark:

Data Generation

falsa groupby --path-prefix=/Users/yongting/data/ --size MEDIUM --data-format PARQUET
https://github.com/mrpowers-io/falsa

Run benchmark in datafusion-cli

CREATE EXTERNAL TABLE IF NOT EXISTS h2o_100m (
    id1 VARCHAR NOT NULL,
    id2 VARCHAR NOT NULL,
    id3 VARCHAR NOT NULL,
    id4 INTEGER NOT NULL,
    id5 INTEGER NOT NULL,
    id6 INTEGER NOT NULL,
    v1 INTEGER NOT NULL,
    v2 INTEGER NOT NULL,
    v3 DOUBLE PRECISION NOT NULL
)
STORED AS parquet
LOCATION '/Users/yongting/data/G1_1e8_1e8_100_0.parquet';

select id2, id4, power(corr(v1, v2), 2) as r2 from h2o_100m group by id2, id4;

Result

Main: 12s
This PR: 4s
(On my MacBook with m4 pro)

Remaining tasks

Implement convert_to_states()

What changes are included in this PR?

Implement two utility functions: accumulate_multiple and accumulate_correlation_states to accumulate states in correlation function. (existing util functions is for aggregate functions with 1 input expr avg(expr1) v.s. corr(expr1, expr2))
Implement GroupsAccumulator for corr()

Are these changes tested?

Unit tests for util functions
corr() is covered by existing tests

Are there any user-facing changes?

No

alamb · 2024-11-27T16:52:06Z

This looks amazing -- thank you @2010YOUY01

I plan to review it over the next day or two

It seems like maybe we should add the data generator for h2o benchmark to the bench.sh script 🤔

Dandandan · 2024-11-27T17:03:24Z

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

+            let nulls = arr
+                .nulls()
+                .expect("If null_count() > 0, nulls must be present");
+            match combined_nulls {


If passing combined_nulls to NullBuffer::union it will take care of handling Option

Implement GroupsAccumulator for corr(x,y)

a834fda

github-actions bot added the functions label Nov 27, 2024

2010YOUY01 mentioned this pull request Nov 27, 2024

[EPIC] Improved aggregate function performance #13548

Open

2 tasks

Dandandan reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

alamb commented Nov 27, 2024

Dandandan Nov 27, 2024 •

edited

Loading

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Are you sure you want to change the base?

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Conversation

2010YOUY01 commented Nov 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Data Generation

Run benchmark in datafusion-cli

Result

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Nov 27, 2024

Dandandan Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

Dandandan Nov 27, 2024 •

edited

Loading