Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Nov 27, 2024

Which issue does this PR close?

Closes #13549

Rationale for this change

Implement GroupsAccumulator for corr aggregation function, for better performance when group cardinality is high

I rerun the H2o benchmark:

Data Generation

falsa groupby --path-prefix=/Users/yongting/data/ --size MEDIUM --data-format PARQUET
https://github.com/mrpowers-io/falsa

Run benchmark in datafusion-cli

CREATE EXTERNAL TABLE IF NOT EXISTS h2o_100m (
    id1 VARCHAR NOT NULL,
    id2 VARCHAR NOT NULL,
    id3 VARCHAR NOT NULL,
    id4 INTEGER NOT NULL,
    id5 INTEGER NOT NULL,
    id6 INTEGER NOT NULL,
    v1 INTEGER NOT NULL,
    v2 INTEGER NOT NULL,
    v3 DOUBLE PRECISION NOT NULL
)
STORED AS parquet
LOCATION '/Users/yongting/data/G1_1e8_1e8_100_0.parquet';

select id2, id4, power(corr(v1, v2), 2) as r2 from h2o_100m group by id2, id4;

Result

Main: 12s
This PR: 4s
(On my MacBook with m4 pro)


Remaining tasks

  • Implement convert_to_states()

What changes are included in this PR?

  1. Implement two utility functions: accumulate_multiple and accumulate_correlation_states to accumulate states in correlation function. (existing util functions is for aggregate functions with 1 input expr avg(expr1) v.s. corr(expr1, expr2))
  2. Implement GroupsAccumulator for corr()

Are these changes tested?

Unit tests for util functions
corr() is covered by existing tests

Are there any user-facing changes?

No

@alamb
Copy link
Contributor

alamb commented Nov 27, 2024

This looks amazing -- thank you @2010YOUY01

I plan to review it over the next day or two

It seems like maybe we should add the data generator for h2o benchmark to the bench.sh script 🤔

let nulls = arr
.nulls()
.expect("If null_count() > 0, nulls must be present");
match combined_nulls {
Copy link
Contributor

@Dandandan Dandandan Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If passing combined_nulls to NullBuffer::union it will take care of handling Option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of corr function
3 participants