Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] Optimize aggregate state serialization performance #3279

Merged
merged 14 commits into from
Nov 16, 2023

Conversation

liuneng1994
Copy link
Contributor

@liuneng1994 liuneng1994 commented Sep 26, 2023

What changes were proposed in this pull request?

Optimize aggregate state serialization performance, convert fixed size agg state to fixed string,
convert variable size aggregation state to string.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@baibaichen
Copy link
Contributor

--conf spark.sql.autoBroadcastJoinThreshold=100MB
HDSF
TPCH 100

image

cpp-ch/local-engine/Builder/SerializedPlanBuilder.cpp Outdated Show resolved Hide resolved
return col;
}
const auto *aggregate_col = checkAndGetColumn<ColumnAggregateFunction>(*col.column);
size_t state_size = aggregate_col->getAggregateFunction()->sizeOfData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the state size of max(String) and collect_list(x) fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude collect_list, max(String) will use sort aggregate and fallback now

}

size_t NativeWriter::write(const DB::Block & block)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of IndexOfBlockForNativeFormat in the original NativeWriter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some index data..useless for us

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from bc8c977 to bde57b9 Compare September 27, 2023 12:18
@github-actions
Copy link

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

{
/** If there are columns-constants - then we materialize them.
* (Since the data type does not know how to serialize / deserialize constants.)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can design some protocol to write less data for constant column to reducing IO amount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是CH copy过来的。。暂时不做优化

bool isFixedSizeStateAggregateFunction(const String& name)
{
// TODO max(String) should exclude, but fallback now
static const std::set<String> function_set = {"min", "max", "sum", "count", "avg"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html, maybe other functions can also be added to the set, like mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我会换一个方案来同时支持定长和变长的聚合

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前先用这个,后面遇到了在更新

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from 1f50efc to de46b01 Compare October 31, 2023 01:58
@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from f3dd446 to c466268 Compare October 31, 2023 03:20
@github-actions
Copy link

Run Gluten Clickhouse CI

3 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 2, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 3, 2023

Run Gluten Clickhouse CI

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from 83b85ba to de32d27 Compare November 3, 2023 09:47
Copy link

github-actions bot commented Nov 3, 2023

Run Gluten Clickhouse CI

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from de32d27 to e0e8249 Compare November 5, 2023 14:40
Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

@liuneng1994 liuneng1994 changed the title [WIP] [CH] Optimize aggregate state serialization performance [CH] Optimize aggregate state serialization performance Nov 6, 2023
@@ -438,8 +438,10 @@ class GlutenAdaptiveQueryExecSuite extends AdaptiveQueryExecSuite with GlutenSQL
test("gluten Exchange reuse") {
withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "100",
SQLConf.SHUFFLE_PARTITIONS.key -> "5") {
// magic threshold, ch backend has two bhj when threshold is 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE @rui-mo please help to check, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

size_t state_size = aggregate_col->getAggregateFunction()->sizeOfData();
auto res_type = std::make_shared<DataTypeFixedString>(state_size);
auto res_col = res_type->createColumn();
PaddedPODArray<UInt8> & column_chars_t = assert_cast<ColumnFixedString &>(*res_col).getChars();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ColumnFixedString::reserve is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同下

Comment on lines +66 to +69
for (const auto & item : aggregate_col->getData())
{
column_chars_t.insert_assume_reserved(item, item + state_size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for small state objects, could we try to use memcpy ?

too much function call may have some performance issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个写法是参考arrowColumnToCHColumn的readColumnWithNumericData。应该不会有性能问题

@liuneng1994 liuneng1994 force-pushed the optimize-agg-state-serialize branch from e0e8249 to 7e54520 Compare November 14, 2023 08:46
Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@lgbo-ustc
Copy link
Contributor

LGTM

Copy link

Run Gluten Clickhouse CI

return isFixedSizeStateAggregateFunction(function->getName()) && isFixedSizeArguments(function->getArgumentTypes());
}

DB::ColumnWithTypeAndName convertAggregateStateToFixedString(DB::ColumnWithTypeAndName col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽量避免传值,避免不必要的复制开销。

}
return DB::ColumnWithTypeAndName(std::move(res_col), type, col.name);
}
DB::Block convertAggregateStateInBlock(DB::Block block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

避免传值


bool isFixedSizeArguments(DataTypes data_types)
{
return data_types.front()->isValueRepresentedByNumber();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isValueUnambiguouslyRepresentedInFixedSizeContiguousMemoryRegion
使用这个接口是否更为准确。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先限制在数字类型,其他类型没测试过

}
DB::Block convertAggregateStateInBlock(DB::Block block)
{
ColumnsWithTypeAndName columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加上reserve

return data_types.front()->isValueRepresentedByNumber();
}

bool isFixedSizeAggregateFunction(DB::AggregateFunctionPtr function)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

避免传值,其他类似地方也可以优化下。


DB::ColumnWithTypeAndName convertAggregateStateToFixedString(DB::ColumnWithTypeAndName col)
{
if (!WhichDataType(col.type).isAggregateFunction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的判断可以去掉,改成下面判断aggregate_col是否为nullptr

Copy link
Contributor

@taiyang-li taiyang-li Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种类型AggregateFunction(sum, Nullable(Int64)) 是否可以走fixed string
现在实际上是走的变长string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug, 需要remove nullable

auto res_type = std::make_shared<DataTypeString>();
auto res_col = res_type->createColumn();
PaddedPODArray<UInt8> & column_chars = assert_cast<ColumnString &>(*res_col).getChars();
column_chars.reserve(aggregate_col->size() * 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60这个数字是怎么来的? 会不会导致分配超出实际需要的内存

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试代码,删了

Copy link

Run Gluten Clickhouse CI

}

bool isFixedSizeAggregateFunction(DB::AggregateFunctionPtr function)
bool isFixedSizeAggregateFunction(const DB::AggregateFunctionPtr& function)
Copy link
Contributor

@zhanglistar zhanglistar Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr
不用传引用吧 直接传值

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就这样吧,没有太大区别

for (const auto & item : aggregate_col->getData())
{
aggregate_col->getAggregateFunction()->serialize(item, value_writer);
writeChar('\0', value_writer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有必要?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

字符串需要\0结尾,必须的

@baibaichen baibaichen force-pushed the optimize-agg-state-serialize branch from cfa0a4f to 80ec518 Compare November 16, 2023 01:36
Copy link

Run Gluten Clickhouse CI

@lgbo-ustc
Copy link
Contributor

LGTM

1 similar comment
@taiyang-li
Copy link
Contributor

LGTM

Copy link
Contributor

@baibaichen baibaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, let's merge

@baibaichen baibaichen merged commit aab6dbe into apache:main Nov 16, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants