-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Fix ORC related failed UT #3437
Conversation
…#3418 What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Fixes: apache#3417 How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
This operator should be removed out of WholeStageTransform Signed-off-by: Yuan Zhou <[email protected]>
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
Make the needed changes to integrate with upstream Velox, and start to depend on a new Velox branch `update`, which shall be rebased frequently to keep updated with Velox upstream. Changes mainly including: Change to use Velox provided Arrow 13.0. Integrate with Velox decimal type and date type refactor. Integrate with Velox memory module updates. Integrate with Velox parquet writer updates. Function integration and fixes. Solve Velox library linking issues, and fix Gluten CI issues. Lacks: ORC support Performance gap on TPC-H/DS GHA docker image update --------- Co-authored-by: JiaKe <[email protected]> Co-authored-by: Rong Ma <[email protected]> Co-authored-by: zhao, zhenhui <[email protected]> Co-authored-by: Hongze Zhang <[email protected]> Co-authored-by: Joey <[email protected]> Co-authored-by: PHILO-HE <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just merged upstream_velox branch into main. Could you change this PR to target on the main branch?
@@ -403,13 +403,11 @@ class VeloxTestSettings extends BackendTestSettings { | |||
enableSuite[GlutenOrcPartitionDiscoverySuite] | |||
.exclude("read partitioned table - normal case") | |||
.exclude("read partitioned table - with nulls") | |||
.disableByReason("Blocked by ORC Velox upstream not ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some ORC tests are disabled with the message ReaderFactory is not registered for format orc
like VeloxTestSettings, could you also enable them in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will try it latter
Run Gluten Clickhouse CI |
Please change this PR to target on the main branch, thanks. |
@@ -358,8 +356,6 @@ class VeloxTestSettings extends BackendTestSettings { | |||
.exclude("Return correct results when data columns overlap with partition " + | |||
"columns (nested data)") | |||
.exclude("SPARK-31116: Select nested schema with case insensitive mode") | |||
// ReaderFactory is not registered for format orc. | |||
.exclude("SPARK-15474 Write and read back non-empty schema with empty dataframe - orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below tests are all about ORC issue, thanks.
https://github.com/oap-project/gluten/blob/main/gluten-ut/spark32/src/test/scala/io/glutenproject/utils/velox/VeloxTestSettings.scala#L361-L368
To avoid partial aggregation flushing on a low cardinality case.
The failed tests |
* Refine the document about parquet write
2306a87
to
8afa641
Compare
Run Gluten Clickhouse CI |
disable GlutenOrcV2AggregatePushDownSuite, Velox don’t support ORC v2 yet. |
8afa641
to
cbacfc5
Compare
Run Gluten Clickhouse CI |
cbacfc5
to
37eb835
Compare
Run Gluten Clickhouse CI |
cpp/velox/compute/VeloxBackend.cc
Outdated
@@ -239,6 +239,7 @@ void VeloxBackend::init(const std::unordered_map<std::string, std::string>& conf | |||
registerConnector(hiveConnector); | |||
velox::parquet::registerParquetReaderFactory(); | |||
velox::dwrf::registerDwrfReaderFactory(); | |||
velox::dwrf::registerOrcReaderFactory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #3445 is going to remove the registration for Parquet and DWRF reader. I'm wondering if the registration for ORC reader can also be done in Velox, then this change can also be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Adjusted this in PR 417
After that merge, I can submit a new PR base on main branch
rebase velox to 2023-11-15 arrow version changed to 14.1.0
* Avoid unnecessary filter binding for subfield
…pache#3650) * move getLocalFilesNode logic to scan transformer
…pache#3753) By default, file handle cache is disabled in gluten for performance consideration.
What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: apache#3668) How was this patch tested? TEST BY UT 性能测试数据 3000W 行正常数据 测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时: 2.983s, 2.686s, 2.804s PR改动后耗时: 2.94s,2.861s,2.842s; 3000W行数据 (其中2500W行是NULL,500W是正常数据) 测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时:0.621s, 0.614s, 0.677s PR改动后耗时:0.631s,0.641s,0.692s; 3000W行数据 (其中2500W数据是不符合日期格式的随机字符串,500W行是正常数据) 测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时:6.148s,6.018s,5.845s PR改动后耗时:3.188s,3.055s,3.08s 对比发现,正常数据测试情况下性能接近,在某些异常场景下性能有所提升
) * [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231117) * fix build due to ClickHouse/ClickHouse#56664 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang Chen <[email protected]>
remove redundant velox build Signed-off-by: Yuan Zhou <[email protected]>
Co-authored-by: Hongze Zhang <[email protected]>
* Support get native plan tree string * fix test * fix * address comments --------- Co-authored-by: Kent Yao <[email protected]>
apache#3781) When there are more than one input row in the final stage with the bloom_filter_agg, it will be core dump for the CH backend. The RC is: when merging values in the final stage, the input data maybe a non-init AggregateFunctionGroupBloomFilterData, it will use the wrong filter size and filter_hashes values to init the first AggregateFunctionGroupBloomFilterData, which leads to set the wrong filter size when merging values. Close apache#3779.
This PR add support for Google Cloud Storage using the Velox GCS Filesystem.
update pacakge.sh to cover spark34 Signed-off-by: Yuan Zhou <[email protected]>
…intermediate types (apache#3721)
@chenxu14 We have enabled spark 3.4 in Gluten. Can you help to also enable the orc unit test in Spark 3.4? Thanks. |
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Support inject native plan string to Spark explain. A new param `spark.gluten.sql.injectNativePlanStringToExplain` is introduced to enable this feature.
c1e8875
to
fc04d0c
Compare
Run Gluten Clickhouse CI |
move work to #3805 |
No description provided.