Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Remove some backend-specific code from common module #3363

Merged
merged 18 commits into from
Oct 16, 2023

Conversation

zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Oct 10, 2023

This patch will remove most of legacy code that was backend-specific from the common modules (gluten-core, gluten-data, shims).

There will still remain some backend-conditional logics in test code which will be removed later.

We should avoid re-adding backend-specific code after the patch got merged as long as we have one alternative way.

@zhztheplayer zhztheplayer marked this pull request as draft October 10, 2023 03:32
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

4 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

3 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member Author

/Benchmark Velox

@zhztheplayer zhztheplayer changed the title WIP: [CORE] Remove some backend-specific code from common module [CORE] Remove some backend-specific code from common module Oct 12, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@zhztheplayer
Copy link
Member Author

Run Gluten Clickhouse CI

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3363_time.csv log/native_master_10_11_2023_dff5d8304_time.csv difference percentage
q1 43.12 43.12 0.003 100.01%
q2 24.61 24.64 0.033 100.13%
q3 37.68 38.32 0.639 101.70%
q4 41.58 41.56 -0.025 99.94%
q5 70.92 68.89 -2.037 97.13%
q6 5.39 6.14 0.748 113.88%
q7 85.27 86.32 1.055 101.24%
q8 80.86 79.49 -1.367 98.31%
q9 115.55 119.77 4.226 103.66%
q10 47.09 46.79 -0.297 99.37%
q11 19.93 20.43 0.492 102.47%
q12 27.71 25.94 -1.765 93.63%
q13 50.00 49.84 -0.162 99.68%
q14 18.72 16.00 -2.718 85.48%
q15 31.27 28.28 -2.993 90.43%
q16 16.09 16.28 0.190 101.18%
q17 121.05 120.47 -0.583 99.52%
q18 165.94 161.82 -4.120 97.52%
q19 12.99 12.90 -0.087 99.33%
q20 29.74 27.22 -2.517 91.54%
q21 236.37 237.65 1.274 100.54%
q22 15.71 15.57 -0.139 99.11%
total 1297.58 1287.43 -10.153 99.22%

@zhztheplayer zhztheplayer marked this pull request as ready for review October 12, 2023 08:53
@github-actions
Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member Author

@zzcclp The patch include a couple of clean-ups against both two backends. Would you like to take a look?

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer marked this pull request as ready for review October 13, 2023 02:21
@github-actions
Copy link

Run Gluten Clickhouse CI

override def metricsApi(): MetricsApi = new MetricsHandler
override def name(): String = VeloxBackend.BACKEND_NAME
override def buildInfo(): GlutenPlugin.BackendBuildInfo =
GlutenPlugin.BackendBuildInfo("Velox", VELOX_BRANCH, VELOX_REVISION, VELOX_REVISION_TIME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use BACKEND_NAME to replace Velox for consistency?

Copy link
Member Author

@zhztheplayer zhztheplayer Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the name in BackendBuildInfo is just a declarative text than BACKEND_NAME which is used for lib loading. I think it's OK to use "Velox" "VELOX" or whatever letting user know it's the Velox backend here.

It might be feasible we have an individual const text "Velox" for printing information to end user, at the same time we rename "BACKEND_NAME" to "BACKEND_LIB_NAME" or something in another patch. Although it's just a trivial change which will not improve things so much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding some comments for BackendBuildInfo to clarify it is only for declarative purpose, and should not be used in real code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding some comments for BackendBuildInfo to clarify it is only for declarative purpose, and should not be used in real code?

Agreed but do you think we can do that in another PR? Since it's a little bit off-topic.

Copy link
Member Author

@zhztheplayer zhztheplayer Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the assumption that this PR is just some code movement across modules. It's far beyond its mission to improve all relevant code's quality. ; )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

@@ -64,7 +64,7 @@ object FallbackUtil extends Logging with AdaptiveSparkPlanHelper {
}
}

def isFallback(plan: SparkPlan): Boolean = {
def hasFallbacks(plan: SparkPlan): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe hasFallback is more suitable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -73,13 +73,9 @@ class BatchScanExecTransformer(
}

override def getInputFilePaths: Seq[String] = {
if (BackendsApiManager.isVeloxBackend) {
Seq.empty[String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Velox specific logic in backends-velox?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Velox backend doesn't use the output of the method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed, relation.location.inputFiles.toSeq is an expensive API, getInputFilePaths's result is useless but this check is needed, as long as CH backend not implement more general partition parsing.
#2405

@github-actions
Copy link

Run Gluten Clickhouse CI

"true".equals(sparkSession.sparkContext.getLocalProperty("isNativeAppliable"))
&& GlutenConfig.isCurrentBackendVelox && false
) {
// why if (false)? Such code requires comments when being written
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better follow the same comment style by capitalizing the first letter and add . at the end for all comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@zhouyuan
Copy link
Contributor

@zzcclp @lwz9103 could you please take a look on this patch? it does a refactor on scala code

thanks, -yuan

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the "BackendsApiManager.isxxxBackend" is removed, can gluten automatically pick the right code path?

@@ -69,7 +69,7 @@ trait CHFormatWriterInjects extends GlutenFormatWriterInjectsBase {
sparkSession: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
throw new UnsupportedOperationException("CHFormatWriterInjects does not support inferSchema")
OrcUtils.inferSchema(sparkSession, files, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzcclp this will change the behavior on ORC

Copy link
Member Author

@zhztheplayer zhztheplayer Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taiyang-li please check this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok

@@ -225,7 +225,7 @@ public static void checkDecimalScale(int scale) {
}

public static ScalarFunctionNode makeScalarFunction(
Long functionId, ArrayList<ExpressionNode> expressionNodes, TypeNode typeNode) {
Long functionId, List<ExpressionNode> expressionNodes, TypeNode typeNode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very unusual practice to define a variable with ArrayList type. Should use the abstract form as long as the code works.

@zhztheplayer
Copy link
Member Author

it looks like the "BackendsApiManager.isxxxBackend" is removed, can gluten automatically pick the right code path?

Backend code is supposed to talk with core module through backend-API. It's the main purpose for the patch to cleanup code to make sure it follows the standard way. So as a result he APIs were removed.

@github-actions
Copy link

Run Gluten Clickhouse CI

@@ -53,12 +60,12 @@ object CHBackendSettings extends BackendSettingsApi with Logging {
// experimental: when the files count per partition exceeds this threshold,
// it will put the files into one partition.
val GLUTEN_CLICKHOUSE_FILES_PER_PARTITION_THRESHOLD: String =
GlutenConfig.GLUTEN_CONFIG_PREFIX + GlutenConfig.GLUTEN_CLICKHOUSE_BACKEND +
GlutenConfig.GLUTEN_CONFIG_PREFIX + CHBackend.BACKEND_NAME +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use "CHBackend.CONFIG_PREFIX" to replace "GlutenConfig.GLUTEN_CONFIG_PREFIX + CHBackend.BACKEND_NAME"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHBraodcastApi -> CHBroadcastApi

@@ -293,7 +293,7 @@ trait GlutenTestsTrait extends GlutenTestsCommonTrait {

def shouldNotFallback(): Unit = {
TestStats.offloadGluten = false
if (BackendsApiManager.getBackendName != GlutenConfig.GLUTEN_CLICKHOUSE_BACKEND) {
if (BackendTestUtils.isCHBackendLoaded()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!BackendTestUtils.isCHBackendLoaded()) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching!

@github-actions
Copy link

Run Gluten Clickhouse CI

*/
package io.glutenproject.execution

abstract class GlutenClickHouseWholeStageTransformerSuite extends WholeStageTransformerSuite {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it needs to add this abstract class ? extending WholeStageTransformerSuite directly ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep it for storing test settings in future. Velox backend has VeloxWholeStageTransformerSuite too after this patch.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Oct 16, 2023

Reviewers, since it's quite a few days since the PR was ready for reviewing, I am planning to merge it by end of today if there's no -1 comments. I will be prepared to actively fixing any possible issues brought by the changes afterwards so please feel free to ping me if you found some.

Also please continue reviewing before merging. Thank you very much.

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 a big refactor!

@zhztheplayer zhztheplayer merged commit 8cd2d0a into apache:main Oct 16, 2023
14 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3363_time.csv log/native_master_10_15_2023_c194db9e9_time.csv difference percentage
q1 42.83 43.89 1.066 102.49%
q2 24.70 24.60 -0.098 99.60%
q3 37.63 38.02 0.396 101.05%
q4 41.42 42.52 1.098 102.65%
q5 70.29 70.23 -0.053 99.92%
q6 6.60 6.55 -0.049 99.26%
q7 83.77 85.59 1.822 102.17%
q8 81.42 81.59 0.179 100.22%
q9 117.32 117.25 -0.073 99.94%
q10 47.74 47.89 0.144 100.30%
q11 19.03 19.78 0.755 103.96%
q12 24.97 26.56 1.586 106.35%
q13 55.51 51.92 -3.594 93.53%
q14 15.01 14.07 -0.946 93.70%
q15 25.57 29.96 4.398 117.20%
q16 16.07 16.14 0.071 100.44%
q17 123.60 121.36 -2.238 98.19%
q18 163.90 163.69 -0.213 99.87%
q19 12.33 12.12 -0.210 98.30%
q20 25.55 28.87 3.317 112.98%
q21 237.53 238.65 1.114 100.47%
q22 15.85 15.83 -0.027 99.83%
total 1288.63 1297.07 8.443 100.66%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants