Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3361] Support spark 3.4 in Gluten #3360

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Oct 10, 2023

What changes were proposed in this pull request?

Follow up #3262.

Mainly changes:

  1. In Spark 3.4, there was an API change in BatchScanExec. You can find the details here. A new table parameter has been added, so it needs to be included in the shim layer.
  2. We have copied some code from vanilla Spark in FileSourceScanExecTransformer. You can find it here. In Spark 3.3 and 3.2, the copied code was private, but in Spark 3.4, it is protected. Therefore, it should be placed in the shim layer. Additionally, GlutenTimeMetric.scala and Arm.scala need to be placed in the shim layer to resolve compilation issues.
  3. Offset.scala is newly introduced in Spark 3.4. Hence, it should be added to the Spark 3.2 and Spark 3.3 shim layers.
  4. Empty2Null is in org.apache.spark.sql.execution.datasources.FileFormatWriter package in spark 3.2 and spark 3.3. So we also need add Empty2Null in org.apache.spark.sql.execution.datasources.FileFormatWriter package in Spark 3.4 to pass the compile issue.
  5. The PartitionedFile API changed the filePath parameter from string to SparkPath object in 3.4.
  6. In Spark 3.4, StatFunctions.scala is no longer needed in shim layer. Therefore, it has been moved from shim common to spark 3.3 and shim spark 3.2 shim layer.
  7. PromotePrecision.scala has been deleted in Spark 3.4.
  8. The unused DwrfScan code has been removed from the Spark 3.2 and Spark 3.3 shim layers.
  9. Since DataSourceStrategyUtil.scala is not used in gluten, it has been deleted.
  10. A new API, visitOffset(p: Offset), has been added to the LogicalPlanVisitor in Spark 3.4.

How was this patch tested?

Existing tests.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

@JkSelf JkSelf changed the title Upgrade spark3.4 [GLUTEN-3361] Upgrade spark version to 3.4 in Gluten Oct 10, 2023
@github-actions
Copy link

#3361

@github-actions
Copy link

Run Gluten Clickhouse CI

@zhouyuan zhouyuan mentioned this pull request Oct 10, 2023
@JkSelf JkSelf changed the title [GLUTEN-3361] Upgrade spark version to 3.4 in Gluten [GLUTEN-3361][WIP] Upgrade spark version to 3.4 in Gluten Oct 10, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@JkSelf JkSelf changed the title [GLUTEN-3361][WIP] Upgrade spark version to 3.4 in Gluten [GLUTEN-3361]Upgrade spark version to 3.4 in Gluten Oct 10, 2023
@PHILO-HE PHILO-HE changed the title [GLUTEN-3361]Upgrade spark version to 3.4 in Gluten [GLUTEN-3361] Support spark 3.4 in Gluten Oct 11, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@JkSelf
Copy link
Contributor Author

JkSelf commented Oct 16, 2023

@zzcclp Can you help to look the failed two unit tests in CH backend? Thanks.

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@@ -40,6 +40,8 @@ class GlutenDriverEndpoint extends IsolatedRpcEndpoint with Logging {
private val driverEndpoint: RpcEndpointRef =
rpcEnv.setupEndpoint(GlutenRpcConstants.GLUTEN_DRIVER_ENDPOINT_NAME, this)

// TODO(yuan): get thread cnt from spark context
override def threadCount(): Int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwz9103 @zzcclp will this change threadCount() impact CK backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value is 1 , right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ok to me.

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to also update dev/buildbundle-veloxbe.sh to build spark 3.4 package

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor

zzcclp commented Oct 19, 2023

LGTM

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 works on my local TPC-H/DS

@zhouyuan zhouyuan merged commit 3e73f32 into apache:main Oct 20, 2023
14 checks passed
@zhouyuan
Copy link
Contributor

notes on TODOs:

  • the UT for Spark 3.4 will be added in follow up commits.
  • The docker image in GHA needs to be updated with Spark 3.4 package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants