Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3547][CORE] [VL] Add native parquet writer in spark 3.4 #3690

Merged
merged 5 commits into from
Dec 27, 2023

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Nov 13, 2023

What changes were proposed in this pull request?

Since the introduction of the WriteFilesExec operator in Spark 3.4 to facilitate write operations (SPARK-41708), we can now utilize this operator to enable native Parquet write. This PR introduces the WriteFilesExecTransformer to delegate the process to the Velox TableWriteNode, and ColumnarWriteFilesExec is added to implement the doExecuteWrite() method by converting RDD[ColumnarBatch] to RDD[WriterCommitMessage]

Due to the differences between vanilla Spark and Velox, this PR has dependencies on the following three upstream Velox PRs:
facebookincubator/velox#8089
facebookincubator/velox#8090
facebookincubator/velox#8091

Limitations and failed unit test recorded here.

How was this patch tested?

Enable spark.gluten.sql.native.writer.enabled config in spark 3.4 unit test.

Copy link

#3547

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from acb1897 to 44fd20a Compare December 6, 2023 10:07
Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from 44fd20a to 97c9b92 Compare December 6, 2023 10:41
Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

4 similar comments
Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 7, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 7, 2023

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from 0fa26d0 to f211d53 Compare December 7, 2023 10:03
Copy link

github-actions bot commented Dec 7, 2023

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from f211d53 to bf51603 Compare December 7, 2023 10:25
Copy link

github-actions bot commented Dec 7, 2023

Run Gluten Clickhouse CI

3 similar comments
Copy link

github-actions bot commented Dec 8, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 8, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 8, 2023

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from 247dda2 to 91e1e8f Compare December 8, 2023 12:56
Copy link

github-actions bot commented Dec 8, 2023

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the write-transformer branch from 91e1e8f to aa51057 Compare December 8, 2023 23:13
Copy link

github-actions bot commented Dec 8, 2023

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

ulysses-you
ulysses-you previously approved these changes Dec 27, 2023
Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @JkSelf

Copy link

Run Gluten Clickhouse CI

@@ -267,7 +268,7 @@ case class FallbackEmptySchemaRelation() extends Rule[SparkPlan] {
TransformHints.tagNotTransformable(p, "at least one of its children has empty output")
p.children.foreach {
child =>
if (child.output.isEmpty) {
if (child.output.isEmpty && !child.isInstanceOf[WriteFilesExec]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason of making the change? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer The output of WriteFilesExec is empty. So it will fallback if no this limitation.

Comment on lines +500 to +505
val Array(major, minor, _) = SparkShimLoader.getSparkShims.getShimDescriptor.toString.split('.')
if (major.toInt > 3 || (major.toInt == 3 && (minor.toInt >= 4))) {
List()
} else {
List(spark => NativeWritePostRule(spark))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow do this in shim? Would that require for lot of effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer Good catch. I will add this in shim in the following PRs. Thanks.

@JkSelf JkSelf merged commit fd33a93 into apache:main Dec 27, 2023
18 checks passed
@WangGuangxin
Copy link
Contributor

@JkSelf Just want to clarify the recommend way to use native parquet writer was to use WriteFilesExec and delegate to native TableWriteNode right? It supports both dynamic partition writer and single partition writer? And after that, both FakeRow and VeloxWriteQueue are useless, right?

@JkSelf
Copy link
Contributor Author

JkSelf commented Jul 24, 2024

@JkSelf Just want to clarify the recommend way to use native parquet writer was to use WriteFilesExec and delegate to native TableWriteNode right? It supports both dynamic partition writer and single partition writer? And after that, both FakeRow and VeloxWriteQueue are useless, right?

@WangGuangxin Yes. You are right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants