-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-1648][VL] Add max_by/min_by aggregate function support #2336
Conversation
Run Gluten Clickhouse CI |
Spark's max_by has a little different with Presto/Velox in implementation, detailed discussion here facebookincubator/velox#4971 . Seems Spark UT does not specify these corner cases, let's add these functions for now, and will be keeping modify above PR if it is necessary. |
Run Gluten Clickhouse CI |
Hi @Yohahaha, thanks for your patch! Recently, to relieve the burden of velox code rebase, we prefer directly contributing velox code to its upstream (exceptions are some fixes on the code only exists in oap/velox, etc.). Could you please try to get the velox patch merged to its upstream? |
Hi @Yohahaha, how large would the difference affect the computing? There could be some risks if the functions are offloaded to Velox but with incorrect result. |
This PR does not include codes which need push to upstream, please check related oap/velox link. |
It's a minor change, different behaviors see below
Spark does not specify |
@Yohahaha Thank you, I see. So the results can be different when there are equal values, right?
Could you explain this statement more a bit? Does that mean Spark's behavior is random? |
Yes.
If we have 2 equal values and both are largest of group, Spark will return second, however Presto return first. Since input data in agg is non-deterministic, Velox community think there is no need to distinguish that difference. And I found Spark does not specify these corner cases as well or any comments for that. |
@Yohahaha Could you help open an issue for this corner case in Gluten? Please also update this limitation in https://github.com/oap-project/gluten/blob/main/docs/velox-backend-limitations.md. Thanks. |
Convert to draft and wait align behavior with Spark. |
Please note we have moved the code related to substrait from oap/velox into gluten. So for your proposed changes in oap/velox, please apply them in |
Wait oap/velox rebase 8.2 |
Hi @Yohahaha, as oap/velox is aligned with the upstream, this PR can be merged. Could you please do a rebase? |
Spark's min_by/max_by need overwrite presto's, but current register function does not provide these two bool args. I'm not sure whether oap-velox still accept new patches. |
@Yohahaha This is my draft PR facebookincubator/velox#7110. I can add these two bool args to min_by and max_by if needed. |
That's great, but I think we could pass specific prefix when register functions such as |
Run Gluten Clickhouse CI |
We've added the overwrite config for max_by and mix_by at Register.cpp#L41. Maybe this PR can be firstly enabled with that. cc @PHILO-HE |
Thanks, I will try later. |
depends on oap-project/velox#426 |
Run Gluten Clickhouse CI |
@Yohahaha There are code style failures. Please refer to https://github.com/oap-project/gluten/blob/main/docs/developers/NewToGluten.md#javascala-code-style for code style adjustment. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
CI passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@Yohahaha Could you resolve the conflicts? Thanks. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
could we merge? |
@@ -153,6 +153,8 @@ case class EncodeDecodeValidator() extends FunctionValidator { | |||
object CHExpressionUtil { | |||
|
|||
final val CH_AGGREGATE_FUNC_BLACKLIST: Map[String, FunctionValidator] = Map( | |||
MAX_BY -> DefaultValidator(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zzcclp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
No description provided.