Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization #8558

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Jan 17, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8557)

How was this patch tested?

test by ut

@KevinyhZou KevinyhZou marked this pull request as draft January 17, 2025 09:11
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 17, 2025
Copy link

#8557

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField Jan 17, 2025
@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField Jan 17, 2025
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject Jan 21, 2025
@KevinyhZou KevinyhZou closed this Jan 21, 2025
@KevinyhZou KevinyhZou reopened this Jan 21, 2025
@KevinyhZou KevinyhZou marked this pull request as ready for review January 21, 2025 11:06
Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

5 similar comments
Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 5, 2025

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("get_struct_field,get_json_object");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about and, or

@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging {
if (udf.udfName.isEmpty) {
throw new GlutenNotSupportException("UDF name is not found!")
}
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦

import org.apache.spark.sql.execution.SparkPlan
import org.apache.spark.sql.types.{DataType, DataTypes}

case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add necessary comment

}

private def canBeOptimized(plan: SparkPlan): Boolean = plan match {
case p: ProjectExecTransformer =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can expression in generate operator be optimized ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems can not

@@ -462,8 +463,55 @@ class GetJsonObjectImpl

static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const
Copy link
Contributor

@taiyang-li taiyang-li Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgbo-ustc pls review changes related to get_json_object

@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser
DB::ActionsDAG & actions_dag) const override
{
const auto & args = substrait_func.arguments();
if (args.size() != 2)
if (args.size() < 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b'), which may have more than 2 arguments.

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Feb 12, 2025

解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。
已经有 #8305 优化的情况下,这个改动带来哪些变化?

@lgbo-ustc
Copy link
Contributor

建议不同函数的优化拆分到不同的PR里面

mutable size_t total_normalized_rows = 0;

template<typename JSONParser, typename JSONStringSerializer>
void insertResultToColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be complex, I guess there should be a simpler implement with less branches

Copy link
Contributor

@lgbo-ustc lgbo-ustc Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain which case it is for each branch

@lgbo-ustc
Copy link
Contributor

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

For get_struct_field, the coalesce is not necessary. get_struct_field is just to extract one of the nested column, the cost is tiny for a batch. I prefer to keep it not changed

@KevinyhZou
Copy link
Contributor Author

Yes, I will remove the changes of get_struct_field.

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject [GLUTEN-8557][CH]Optimize nested function calls for And/Or Feb 14, 2025
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@@ -142,6 +160,8 @@ object ExpressionConverter extends SQLConfHelper with Logging {
expr match {
case p: PythonUDF =>
return replacePythonUDFWithExpressionTransformer(p, attributeSeq, expressionsMap)
case s: ScalaUDF if CollapsedExpressionMappings.supported(s.udfName.getOrElse("")) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does any scala udf expression need to be collapsed?

Copy link
Contributor Author

@KevinyhZou KevinyhZou Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not support scala udf now, but it use scala udf to convert nested expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE can you help review code changes under gluten/substrait. In theory, it doesn't have impact on the behavior of velox backend.

Copy link
Contributor

@taiyang-li taiyang-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("and,or");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinyhZou, do you find any unsuitable corner cases? E.g., wrong result, performance degradation. If no, can we always enable this optimization without introducing a config?

Copy link
Contributor Author

@KevinyhZou KevinyhZou Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not found unsuitable case for the ci testing and my own testing. But I'm afraid there maybe some unsuitable case in our online sqls, which the ci testing do not cover,So I think it‘s better to keep them

* ScalaUDF at first, and then pass the ScalaUDF to clickhouse backend. e.g. And(And(a=1,
* b=2),c=3) can be optimized to And(a=1, b=2, c=3),but And(a=1, b=2, c=3) can not be
* supported by spark `And` function, so we need to convert it to ScalaUDF, with name is
* `And`, and have 3 arguments, when pass the `ScalaUDF(#and(a=1,b=2,c=3))` to clickhouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must we need a ScalaUDF? Can we make an optimized plan just compatible with the backend? Regardless of that it's not compatible with Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the ScalaUDF, and re-implement this by use CHCollapseExpressionTransformer, help to review this pr ? thanks @PHILO-HE

@PHILO-HE PHILO-HE changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/Or [GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization Feb 21, 2025
Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 3, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 4, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

github-actions bot commented Mar 6, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Mar 8, 2025

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH]Optimize nested function calls
4 participants