Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-1648][VL] Add max_by/min_by aggregate function support #2336

Merged
merged 2 commits into from
Nov 9, 2023

Conversation

Yohahaha
Copy link
Contributor

No description provided.

@github-actions
Copy link

Run Gluten Clickhouse CI

@Yohahaha Yohahaha changed the title [VL] Add max_by/min_by aggregate function support [GLUTEN-1648][VL] Add max_by/min_by aggregate function support Jul 14, 2023
@github-actions
Copy link

#1648

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Jul 14, 2023

Spark's max_by has a little different with Presto/Velox in implementation, detailed discussion here facebookincubator/velox#4971 .

Seems Spark UT does not specify these corner cases, let's add these functions for now, and will be keeping modify above PR if it is necessary.

@github-actions
Copy link

Run Gluten Clickhouse CI

@Yohahaha
Copy link
Contributor Author

@rui-mo @PHILO-HE could you help take a look?

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jul 14, 2023

Hi @Yohahaha, thanks for your patch! Recently, to relieve the burden of velox code rebase, we prefer directly contributing velox code to its upstream (exceptions are some fixes on the code only exists in oap/velox, etc.). Could you please try to get the velox patch merged to its upstream?
It's OK to submit this Gluten PR for early code review & Gluten CI verifications. I will add pending velox rebase tag. Thanks!

@PHILO-HE PHILO-HE added the pending velox rebase the patch will be merged after velox rebased with Meta main branch label Jul 14, 2023
@rui-mo
Copy link
Contributor

rui-mo commented Jul 14, 2023

Spark's max_by has a little different with Presto/Velox in implementation, detailed discussion here facebookincubator/velox#4971 .

Seems Spark UT does not specify these corner cases, let's add these functions for now, and will be keeping modify above PR if it is necessary.

Hi @Yohahaha, how large would the difference affect the computing? There could be some risks if the functions are offloaded to Velox but with incorrect result.

@Yohahaha
Copy link
Contributor Author

Hi @Yohahaha, thanks for your patch! Recently, to relieve the burden of velox code rebase, we prefer directly contributing velox code to its upstream (exceptions are some fixes on the code only exists in oap/velox, etc.). Could you please try to get the velox patch merged to its upstream? It's OK to submit this Gluten PR for early code review & Gluten CI verifications. I will add pending velox rebase tag. Thanks!

This PR does not include codes which need push to upstream, please check related oap/velox link.

@Yohahaha
Copy link
Contributor Author

Spark's max_by has a little different with Presto/Velox in implementation, detailed discussion here facebookincubator/velox#4971 .
Seems Spark UT does not specify these corner cases, let's add these functions for now, and will be keeping modify above PR if it is necessary.

Hi @Yohahaha, how large would the difference affect the computing? There could be some risks if the functions are offloaded to Velox but with incorrect result.

It's a minor change, different behaviors see below

SELECT max_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 50) AS tab(x, y)

Spark result is c
Presto result is b

Spark does not specify max_by/min_by as deterministic, so I pending update above PR.

@rui-mo
Copy link
Contributor

rui-mo commented Jul 14, 2023

SELECT max_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 50) AS tab(x, y)

Spark result is c
Presto result is b

@Yohahaha Thank you, I see. So the results can be different when there are equal values, right?

Spark does not specify max_by/min_by as deterministic, so I pending update above PR.

Could you explain this statement more a bit? Does that mean Spark's behavior is random?

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Jul 14, 2023

So the results can be different when there are equal values, right?

Yes.

Could you explain this statement more a bit? Does that mean Spark's behavior is random?

If we have 2 equal values and both are largest of group, Spark will return second, however Presto return first. Since input data in agg is non-deterministic, Velox community think there is no need to distinguish that difference.

And I found Spark does not specify these corner cases as well or any comments for that.

@rui-mo
Copy link
Contributor

rui-mo commented Jul 17, 2023

@Yohahaha Could you help open an issue for this corner case in Gluten? Please also update this limitation in https://github.com/oap-project/gluten/blob/main/docs/velox-backend-limitations.md. Thanks.

@Yohahaha Yohahaha marked this pull request as draft July 17, 2023 03:10
@Yohahaha
Copy link
Contributor Author

Convert to draft and wait align behavior with Spark.

@PHILO-HE
Copy link
Contributor

Please note we have moved the code related to substrait from oap/velox into gluten. So for your proposed changes in oap/velox, please apply them in gluten/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc. Thanks!

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Aug 2, 2023

Wait oap/velox rebase 8.2

@PHILO-HE
Copy link
Contributor

Hi @Yohahaha, as oap/velox is aligned with the upstream, this PR can be merged. Could you please do a rebase?

@Yohahaha
Copy link
Contributor Author

@rui-mo I see this change and may_by/min_by support also need change here, this is your expected?

@rui-mo
Copy link
Contributor

rui-mo commented Oct 20, 2023

@rui-mo I see this change and may_by/min_by support also need change here, this is your expected?

@Yohahaha This change is to allow companion function registration through config. What do you need for max_by and min_by?

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Oct 20, 2023

@rui-mo I see this change and may_by/min_by support also need change here, this is your expected?

@Yohahaha This change is to allow companion function registration through config. What do you need for max_by and min_by?

Spark's min_by/max_by need overwrite presto's, but current register function does not provide these two bool args. I'm not sure whether oap-velox still accept new patches.

https://github.com/oap-project/velox/blob/b8fc63ec35f049a49e417ba4b0845c2b1e367748/velox/functions/sparksql/aggregates/Register.cpp#L33C33-L43

@rui-mo
Copy link
Contributor

rui-mo commented Oct 20, 2023

@Yohahaha This is my draft PR facebookincubator/velox#7110. I can add these two bool args to min_by and max_by if needed.

@Yohahaha
Copy link
Contributor Author

@Yohahaha This is my draft PR facebookincubator/velox#7110. I can add these two bool args to min_by and max_by if needed.

That's great, but I think we could pass specific prefix when register functions such as presto_ or spark_, and add prefix conversion logic in somewhere when convert substait plan to velox plan, it will bring benefits for use more accurate functions. What's your thoughts?

@Yohahaha
Copy link
Contributor Author

#3472

@Yohahaha Yohahaha closed this Oct 20, 2023
@Yohahaha Yohahaha reopened this Oct 20, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

@rui-mo
Copy link
Contributor

rui-mo commented Oct 25, 2023

We've added the overwrite config for max_by and mix_by at Register.cpp#L41. Maybe this PR can be firstly enabled with that. cc @PHILO-HE

@Yohahaha
Copy link
Contributor Author

We've added the overwrite config for max_by and mix_by at Register.cpp#L41. Maybe this PR can be firstly enabled with that. cc @PHILO-HE

Thanks, I will try later.

@Yohahaha
Copy link
Contributor Author

We've added the overwrite config for max_by and mix_by at Register.cpp#L41. Maybe this PR can be firstly enabled with that. cc @PHILO-HE

depends on oap-project/velox#426

@Yohahaha Yohahaha marked this pull request as ready for review October 27, 2023 09:26
@github-actions
Copy link

Run Gluten Clickhouse CI

@rui-mo
Copy link
Contributor

rui-mo commented Oct 30, 2023

@Yohahaha There are code style failures. Please refer to https://github.com/oap-project/gluten/blob/main/docs/developers/NewToGluten.md#javascala-code-style for code style adjustment.

@github-actions
Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 6, 2023

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Nov 6, 2023

Run Gluten Clickhouse CI

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Nov 6, 2023

CI passed.

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Nov 7, 2023

please help review again, thanks! @rui-mo @PHILO-HE

PHILO-HE
PHILO-HE previously approved these changes Nov 7, 2023
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rui-mo
Copy link
Contributor

rui-mo commented Nov 7, 2023

@Yohahaha Could you resolve the conflicts? Thanks.

Copy link

github-actions bot commented Nov 8, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 9, 2023

Run Gluten Clickhouse CI

@Yohahaha
Copy link
Contributor Author

Yohahaha commented Nov 9, 2023

could we merge?

@rui-mo rui-mo requested a review from zzcclp November 9, 2023 08:23
@@ -153,6 +153,8 @@ case class EncodeDecodeValidator() extends FunctionValidator {
object CHExpressionUtil {

final val CH_AGGREGATE_FUNC_BLACKLIST: Map[String, FunctionValidator] = Map(
MAX_BY -> DefaultValidator(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zzcclp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PHILO-HE PHILO-HE merged commit fee6529 into apache:main Nov 9, 2023
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_2336_time.csv log/native_master_11_08_2023_29b58997a_time.csv difference percentage
q1 34.21 34.11 -0.094 99.73%
q2 25.55 25.13 -0.417 98.37%
q3 36.75 38.62 1.867 105.08%
q4 38.28 36.07 -2.205 94.24%
q5 71.63 71.84 0.214 100.30%
q6 7.96 7.59 -0.372 95.33%
q7 84.50 85.82 1.317 101.56%
q8 87.28 86.97 -0.312 99.64%
q9 123.47 121.86 -1.610 98.70%
q10 52.42 54.80 2.381 104.54%
q11 20.13 20.59 0.456 102.26%
q12 25.23 24.63 -0.596 97.64%
q13 49.14 49.95 0.819 101.67%
q14 17.55 16.99 -0.560 96.81%
q15 30.29 30.97 0.677 102.23%
q16 15.88 16.51 0.630 103.97%
q17 101.25 102.40 1.151 101.14%
q18 149.37 148.28 -1.095 99.27%
q19 14.95 15.13 0.188 101.25%
q20 29.13 29.97 0.838 102.88%
q21 226.85 222.20 -4.653 97.95%
q22 13.44 13.55 0.109 100.81%
total 1255.27 1254.01 -1.266 99.90%

@Yohahaha Yohahaha deleted the max_by branch November 10, 2023 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending velox rebase the patch will be merged after velox rebased with Meta main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants