feat: Enable Comet broadcast by default #213

viirya · 2024-03-18T00:19:42Z

Which issue does this PR close?

Closes #212.
Closes #241.
Closes #243.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya · 2024-03-18T00:20:47Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

+        case plan
+            if isCometNative(plan) &&
+              plan.children.exists(_.isInstanceOf[BroadcastExchangeExec]) =>
+          val newChildren = plan.children.map {
+            case b: BroadcastExchangeExec
+                if isCometNative(b.child) &&
+                  isCometOperatorEnabled(conf, "broadcastExchangeExec") =>


Using the common operator enable config to control broadcast operator as other operators.

viirya · 2024-03-18T00:21:12Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

+    val operatorDisabledFlag = s"$COMET_EXEC_CONFIG_PREFIX.$operator.disabled"
+    conf.getConfString(operatorFlag, "false").toBoolean || isCometAllOperatorEnabled(conf) &&
+      !conf.getConfString(operatorDisabledFlag, "false").toBoolean


This is added to be able disable certain operator specially.

codecov-commenter · 2024-03-18T05:20:24Z

Codecov Report

Attention: Patch coverage is 38.23529% with 63 lines in your changes are missing coverage. Please review.

Project coverage is 33.58%. Comparing base (aa6ddc5) to head (32f3ae1).
Report is 2 commits behind head on main.

Files	Patch %	Lines
.../scala/org/apache/spark/sql/comet/util/Utils.scala	0.00%	38 Missing ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	59.25%	5 Missing and 6 partials ⚠️
...e/spark/sql/comet/CometBroadcastExchangeExec.scala	41.17%	9 Missing and 1 partial ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala	77.77%	1 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #213      +/-   ##
============================================
+ Coverage     33.48%   33.58%   +0.09%     
- Complexity      776      780       +4     
============================================
  Files           108      107       -1     
  Lines         37178    37211      +33     
  Branches       8146     8160      +14     
============================================
+ Hits          12448    12496      +48     
+ Misses        22107    22076      -31     
- Partials       2623     2639      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

snmvaughan

LGTM

viirya · 2024-03-19T16:34:52Z

cc @sunchao

viirya · 2024-03-26T18:29:02Z

cc @sunchao Please take a look. Thanks.

sunchao

LGTM (pending CI)

viirya · 2024-03-29T19:13:35Z

A few tests seem needed to be updated. Let me take a look.

viirya · 2024-04-03T06:34:04Z

spark/src/main/scala/org/apache/spark/sql/comet/operators.scala

+    batches.map { batch =>
      val codec = CompressionCodec.createCodec(SparkEnv.get.conf)
      val cbbos = new ChunkedByteBufferOutputStream(1024 * 1024, ByteBuffer.allocate)
      val out = new DataOutputStream(codec.compressedOutputStream(cbbos))

-      val count = new NativeUtil().serializeBatches(iter, out)
+      val (fieldVectors, batchProviderOpt) = nativeUtil.getBatchFieldVectors(batch)
+      val root = new VectorSchemaRoot(fieldVectors.asJava)
+      val provider = batchProviderOpt.getOrElse(nativeUtil.getDictionaryProvider)
+
+      val writer = new ArrowStreamWriter(root, provider, Channels.newChannel(out))
+      writer.start()
+      writer.writeBatch()
+
+      root.clear()
+      writer.end()


Previously serializeBatches is wrong which serializes all batches with a ArrowStreamWriter. It causes wrong results when serializing dictionary arrays, i.e., #241.

Each batch could have different dictionary provider content. But when ArrowStreamWriter starts to serialize, it writes out dictionaries at the beginning. So later batch will use incorrect dictionary value.

viirya · 2024-04-03T06:34:46Z

spark/src/main/scala/org/apache/spark/sql/comet/CometBroadcastExchangeExec.scala

@@ -191,7 +193,7 @@ case class CometBroadcastExchangeExec(originalPlan: SparkPlan, child: SparkPlan)
  override protected def doExecuteColumnar(): RDD[ColumnarBatch] = {
    val broadcasted = executeBroadcast[Array[ChunkedByteBuffer]]()

-    new CometBatchRDD(sparkContext, broadcasted.value.length, broadcasted)
+    new CometBatchRDD(sparkContext, childRDD.getNumPartitions, broadcasted)


The broadcast RDD must have same number of partitions as child RDD. Previously we serialize all batches in one partition into a ChunkedByteBuffer, so broadcasted.value.length is the number of partitions. Now it is changed to serialize one batch in one ChunkedByteBuffer, so we need to use the correct number.

Update. Child RDD partition number may also not be same as the zipping side. We need to get the number of partition of zipping side when triggering this execution method.

This issue is described in #243.

viirya · 2024-04-03T16:46:43Z

common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala

+  def serializeBatches(batches: Iterator[ColumnarBatch]): Iterator[(Long, ChunkedByteBuffer)] = {
+    batches.map { batch =>
+      val dictionaryProvider: CDataDictionaryProvider = new CDataDictionaryProvider
+
+      val codec = CompressionCodec.createCodec(SparkEnv.get.conf)
+      val cbbos = new ChunkedByteBufferOutputStream(1024 * 1024, ByteBuffer.allocate)


I need to move serializeBatches into spark package because ChunkedByteBufferOutputStream is a spark private class. I cannot move serializeBatches to spark module because it uses arrow packages (we shade arrow in common module).

viirya · 2024-04-05T13:38:23Z

Alright. I fixed all bugs (#241, #243) around broadcast and now all CIs are passed. I will go to merge later today.

viirya · 2024-04-05T18:04:55Z

Merged. Thanks.

These changes to testing were included in apache/datafusion-comet#213

* feat: Remove COMET_EXEC_BROADCAST_ENABLED * Fix * Fix * Update plan stability * Fix * Remove unused import and class * Fix * Remove unused imports * Fix * Fix scala style * fix * Fix * Update diff

viirya commented Mar 18, 2024

View reviewed changes

viirya force-pushed the remove_broadcast_config branch 2 times, most recently from a24c3b5 to 86167fd Compare March 18, 2024 04:35

viirya changed the title ~~feat: Remove COMET_EXEC_BROADCAST_ENABLED~~ feat: Enable Comet broadcast by default Mar 18, 2024

snmvaughan approved these changes Mar 19, 2024

View reviewed changes

feat: Remove COMET_EXEC_BROADCAST_ENABLED

5fa8781

viirya force-pushed the remove_broadcast_config branch from d25c21d to 5fa8781 Compare March 29, 2024 18:26

sunchao approved these changes Mar 29, 2024

View reviewed changes

viirya added 3 commits March 29, 2024 15:23

Fix

c804574

Fix

2c4597a

Update plan stability

fd8e343

viirya mentioned this pull request Apr 2, 2024

Comet Broadcast doesn't work with dictionary array #241

Closed

Fix

6cb63ed

viirya commented Apr 3, 2024

View reviewed changes

Remove unused import and class

25eec59

viirya force-pushed the remove_broadcast_config branch from 3010119 to 25eec59 Compare April 3, 2024 15:01

Fix

437a126

viirya commented Apr 3, 2024

View reviewed changes

viirya added 4 commits April 3, 2024 09:50

Remove unused imports

db5961d

Fix

96b4aa4

Fix scala style

778466a

fix

3628bd6

viirya mentioned this pull request Apr 5, 2024

Support BroadcastNestedLoopJoinExec #198

Open

Fix

1712f0e

Update diff

32f3ae1

viirya mentioned this pull request Apr 5, 2024

Incorrect number of partitions in Comet broadcast operator #243

Closed

viirya merged commit d76c113 into apache:main Apr 5, 2024
29 checks passed

viirya mentioned this pull request Apr 10, 2024

test: Restore tests in CometTPCDSQuerySuite #252

Merged

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024

[MINOR] feat: Enable Comet broadcast by default (apache#1941)

972bf81

These changes to testing were included in apache/datafusion-comet#213

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable Comet broadcast by default #213

feat: Enable Comet broadcast by default #213

viirya commented Mar 18, 2024 •

edited

Loading

viirya Mar 18, 2024

viirya Mar 18, 2024

codecov-commenter commented Mar 18, 2024 •

edited

Loading

snmvaughan left a comment

viirya commented Mar 19, 2024

viirya commented Mar 26, 2024

sunchao left a comment

viirya commented Mar 29, 2024

viirya Apr 3, 2024

viirya Apr 3, 2024 •

edited

Loading

viirya Apr 5, 2024

viirya Apr 5, 2024

viirya Apr 3, 2024

viirya commented Apr 5, 2024 •

edited

Loading

viirya commented Apr 5, 2024

feat: Enable Comet broadcast by default #213

feat: Enable Comet broadcast by default #213

Conversation

viirya commented Mar 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya Mar 18, 2024

Choose a reason for hiding this comment

viirya Mar 18, 2024

Choose a reason for hiding this comment

codecov-commenter commented Mar 18, 2024 • edited Loading

Codecov Report

snmvaughan left a comment

Choose a reason for hiding this comment

viirya commented Mar 19, 2024

viirya commented Mar 26, 2024

sunchao left a comment

Choose a reason for hiding this comment

viirya commented Mar 29, 2024

viirya Apr 3, 2024

Choose a reason for hiding this comment

viirya Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

viirya Apr 5, 2024

Choose a reason for hiding this comment

viirya Apr 5, 2024

Choose a reason for hiding this comment

viirya Apr 3, 2024

Choose a reason for hiding this comment

viirya commented Apr 5, 2024 • edited Loading

viirya commented Apr 5, 2024

viirya commented Mar 18, 2024 •

edited

Loading

codecov-commenter commented Mar 18, 2024 •

edited

Loading

viirya Apr 3, 2024 •

edited

Loading

viirya commented Apr 5, 2024 •

edited

Loading