[VL] Provide options to combine small batches before sending to shuffle #6009

zhztheplayer · 2024-06-07T02:43:51Z

It's observed that Velox hash-based shuffle is slowed down by small input batches.

The patch:

Adds two options:
- spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle
  (Default: false) Set to true to combine small batches with minimal batch size determined by spark.gluten.sql.columnar.maxBatchSize. (Note the misnaming of maxBatchSize in Gluten, it might tend to be minBatchSize or just batchSize)
- spark.gluten.sql.columnar.backend.velox.minBatchSizeForShuffle
  (Optional) Set to override the minimal batch size used by coalesceBatchesBeforeShuffle.
Does essential code refactors and cleanups.

Comparisons

(spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle=false/true)

Q31 total time, before and after (SF1000 partitioned table, scan partitions 112, shuffle partitions 112):

Closer look at exchange, before and after:

zhztheplayer · 2024-06-07T02:49:55Z

@marin-ma There might be some batch-wise overhead around shuffle split processing. We may want to figure it out later to avoid doing such batch coalesce operations that introduce extra copies.

marin-ma · 2024-06-07T03:37:37Z

cc: @WangGuangxin @FelixYBW

…lating batches" This reverts commit ec3e92e.

github-actions · 2024-06-11T01:52:40Z

Run Gluten Clickhouse CI

Yohahaha · 2024-06-11T01:53:59Z

#5951 (comment)
I see above comments try add a new operator to collect small batch to a bigger batch, is will be included in this PR?

zhztheplayer · 2024-06-11T01:59:43Z

@XinShuoWang

Do you have any thoughts on moving to this approach since ec3e92e has been merged? I don't have strong preference except that we need a configuration and metrics for batch-appending.

I suggest merging this to make things configurable and displayable at first, then if we want to continue on #5951 's approach, you can open another PR to bring the code back and re-use the conf code added in this patch. And remove calls to maybeAddAppendBatchesExec to disable AppendBatchesExec for shuffle.

zhztheplayer · 2024-06-11T02:04:27Z

#5951 (comment) I see above comments try add a new operator to collect small batch to a bigger batch, is will be included in this PR?

It's included, namely VeloxAppendBatchesExec.

Code https://github.com/apache/incubator-gluten/pull/6009/files#diff-e058841b0f54a15144e01a1089457b7e5a28a953f85df3a1e656fcf6eddfe203R38

ulysses-you · 2024-06-11T02:11:33Z

/Benchmark Velox TPCDS

zhztheplayer · 2024-06-11T02:26:52Z

cpp/core/jni/JniCommon.cc

+std::unique_ptr<gluten::JniColumnarBatchIterator> gluten::makeJniColumnarBatchIterator(
+    JNIEnv* env,
+    jobject jColumnarBatchItr,
+    gluten::Runtime* runtime,
+    std::shared_ptr<ArrowWriter> writer) {
+  return std::make_unique<JniColumnarBatchIterator>(env, jColumnarBatchItr, runtime, writer);
+}
+
+gluten::JniColumnarBatchIterator::JniColumnarBatchIterator(
+    JNIEnv* env,
+    jobject jColumnarBatchItr,
+    gluten::Runtime* runtime,
+    std::shared_ptr<ArrowWriter> writer)
+    : runtime_(runtime), writer_(writer) {
+  // IMPORTANT: DO NOT USE LOCAL REF IN DIFFERENT THREAD
+  if (env->GetJavaVM(&vm_) != JNI_OK) {
+    std::string errorMessage = "Unable to get JavaVM instance";
+    throw gluten::GlutenException(errorMessage);
+  }
+  serializedColumnarBatchIteratorClass_ =
+      createGlobalClassReferenceOrError(env, "Lorg/apache/gluten/vectorized/ColumnarBatchInIterator;");
+  serializedColumnarBatchIteratorHasNext_ =
+      getMethodIdOrError(env, serializedColumnarBatchIteratorClass_, "hasNext", "()Z");
+  serializedColumnarBatchIteratorNext_ = getMethodIdOrError(env, serializedColumnarBatchIteratorClass_, "next", "()J");
+  jColumnarBatchItr_ = env->NewGlobalRef(jColumnarBatchItr);
+}
+
+gluten::JniColumnarBatchIterator::~JniColumnarBatchIterator() {
+  JNIEnv* env;
+  attachCurrentThreadAsDaemonOrThrow(vm_, &env);
+  env->DeleteGlobalRef(jColumnarBatchItr_);
+  env->DeleteGlobalRef(serializedColumnarBatchIteratorClass_);
+  vm_->DetachCurrentThread();
+}
+
+std::shared_ptr<gluten::ColumnarBatch> gluten::JniColumnarBatchIterator::next() {
+  JNIEnv* env;
+  attachCurrentThreadAsDaemonOrThrow(vm_, &env);
+  if (!env->CallBooleanMethod(jColumnarBatchItr_, serializedColumnarBatchIteratorHasNext_)) {
+    checkException(env);
+    return nullptr; // stream ended
+  }
+
+  checkException(env);
+  jlong handle = env->CallLongMethod(jColumnarBatchItr_, serializedColumnarBatchIteratorNext_);
+  checkException(env);
+  auto batch = runtime_->objectStore()->retrieve<ColumnarBatch>(handle);
+  if (writer_ != nullptr) {
+    // save snapshot of the batch to file
+    std::shared_ptr<ArrowSchema> schema = batch->exportArrowSchema();
+    std::shared_ptr<ArrowArray> array = batch->exportArrowArray();
+    auto rb = gluten::arrowGetOrThrow(arrow::ImportRecordBatch(array.get(), schema.get()));
+    GLUTEN_THROW_NOT_OK(writer_->initWriter(*(rb->schema().get())));
+    GLUTEN_THROW_NOT_OK(writer_->writeInBatches(rb));
+  }
+  return batch;
+}


code movement

zhztheplayer · 2024-06-11T02:27:03Z

cpp/core/jni/JniCommon.h

+class JniColumnarBatchIterator : public ColumnarBatchIterator {
+ public:
+  explicit JniColumnarBatchIterator(
+      JNIEnv* env,
+      jobject jColumnarBatchItr,
+      Runtime* runtime,
+      std::shared_ptr<ArrowWriter> writer);
+
+  // singleton
+  JniColumnarBatchIterator(const JniColumnarBatchIterator&) = delete;
+  JniColumnarBatchIterator(JniColumnarBatchIterator&&) = delete;
+  JniColumnarBatchIterator& operator=(const JniColumnarBatchIterator&) = delete;
+  JniColumnarBatchIterator& operator=(JniColumnarBatchIterator&&) = delete;
+
+  virtual ~JniColumnarBatchIterator();
+
+  std::shared_ptr<ColumnarBatch> next() override;
+
+ private:
+  JavaVM* vm_;
+  jobject jColumnarBatchItr_;
+  Runtime* runtime_;
+  std::shared_ptr<ArrowWriter> writer_;
+
+  jclass serializedColumnarBatchIteratorClass_;
+  jmethodID serializedColumnarBatchIteratorHasNext_;
+  jmethodID serializedColumnarBatchIteratorNext_;
+};
+
+std::unique_ptr<JniColumnarBatchIterator> makeJniColumnarBatchIterator(
+    JNIEnv* env,
+    jobject jColumnarBatchItr,
+    Runtime* runtime,
+    std::shared_ptr<ArrowWriter> writer);


code movement

zhztheplayer · 2024-06-11T02:27:32Z

gluten-core/src/main/scala/org/apache/gluten/extension/columnar/OffloadSingleNode.scala

+    case p if TransformHints.isNotTransformable(p) =>
+      p
+    case s: ShuffleExchangeExec
+        if (s.child.supportsColumnar || GlutenConfig.getConf.enablePreferColumnar) &&
+          BackendsApiManager.getSettings.supportColumnarShuffleExec() =>
+      logDebug(s"Columnar Processing for ${s.getClass} is currently supported.")
+      BackendsApiManager.getSparkPlanExecApiInstance.genColumnarShuffleExchange(s)
+    case b: BroadcastExchangeExec =>
+      val child = b.child
+      logDebug(s"Columnar Processing for ${b.getClass} is currently supported.")
+      ColumnarBroadcastExchangeExec(b.mode, child)


code simplification

zhztheplayer · 2024-06-11T02:28:07Z

gluten-core/src/main/scala/org/apache/gluten/backendsapi/SparkPlanExecApi.scala

@@ -101,7 +101,7 @@ trait SparkPlanExecApi {
      aggregateExpressions: Seq[AggregateExpression],
      aggregateAttributes: Seq[Attribute]): HashAggregateExecPullOutBaseHelper

-  def genColumnarShuffleExchange(shuffle: ShuffleExchangeExec, newChild: SparkPlan): SparkPlan
+  def genColumnarShuffleExchange(shuffle: ShuffleExchangeExec): SparkPlan


API simplification

Yohahaha · 2024-06-11T03:17:12Z

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.

my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

zhli1142015 · 2024-06-11T03:35:33Z

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.

my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

I think filter and HashJoin(with filter) may also need this as well. Thanks.

Yohahaha · 2024-06-11T04:00:53Z

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.
my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

I think filter and HashJoin(with filter) may also need this as well. Thanks.

yeah, above velox issue has list out all operators that may need apply this optimization, we could provide a config for these operators, which value is a list of operator name string.

FelixYBW · 2024-06-11T04:05:12Z

spark.gluten.sql.columnar.maxBatchSize. (Note the misnaming of maxBatchSize in Gluten, it might tend to be minBatchSize)

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

FelixYBW · 2024-06-11T04:06:47Z

@zhztheplayer Let's remove the option and set it as default behavior as long as it can benefit in all cases.

FelixYBW · 2024-06-11T04:11:51Z

around shuffle split processing. We may want to figure it out later to avoid doing such batch coalesce operations that intro

It's because the initialization of current split function. Currently we use 3 loops (per column, per reducer, per row) to do the split, if the column data is cached then the solution is the best way to scale to reducer numbers. However to achieve this, we need much initialization work to create several vectors. If the input batch is small, we will suffer from the initialization overhead. Even bigger than the copy to bigger batches.

Another issue is if the data size is too large and exceeds the cache size, then performance will be very poor.

zhztheplayer · 2024-06-11T04:12:20Z

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

It's planned but I'll use another PR to do that. Setting it true by default may fail some plan checks in UTs.

FelixYBW · 2024-06-11T04:16:28Z

It's planned but I'll use another PR to do that. Setting it true by default may fail some plan checks in UTs.

Why UT fails? we should fix that.

zhztheplayer · 2024-06-11T04:18:47Z

Why UT fails? we should fix that.

The PR adds a new operator VeloxAppendBatchesExec. Plan checks may fail because of this new operator. It's benign and to fix this we just update the UT code.

FelixYBW · 2024-06-11T04:19:19Z

checks may fail because of this new operator. It's benign and to fix this we just update the UT code.

Let's update the UT code then.

zhztheplayer · 2024-06-11T04:24:32Z

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

The tricky part is that I don't see we always follow either min or max criteria when using the option. (correct me if I am wrong) It's used as min in shuffle reader, but may be used as max in shuffle writer. I don't go through scan's code so not sure about that part.

Maybe we can change the option name to targetBatchSize to clarify. Perhaps it's not that important whether a batch is slightly larger than or smaller than this size.

FelixYBW · 2024-06-11T04:24:37Z

One misunderstanding is that "we should avoid memcpy as much as possible", but in fact Gluten isn't memory throughput bound yet. The sequential data read and write is a cheaper operation if the block size isn't too small (like several Bytes but not predictable, then the overhead is branch misprediction) or too large (like GB level).

FelixYBW · 2024-06-11T04:27:04Z

e we can change the option name

In long term Velox should limit the batch size to maxBatchSize we configured, so as in Gluten. For operators like Combine, we may limit the batch size to 2 x maxBatchSize so we needn't to cache the second batch.

FelixYBW · 2024-06-11T04:33:02Z

It's included, namely VeloxAppendBatchesExec.

Code https://github.com/apache/incubator-gluten/pull/6009/files#diff-e058841b0f54a15144e01a1089457b7e5a28a953f85df3a1e656fcf6eddfe203R38

We may need to propose this Exec to Velox

marin-ma

LGTM. Thanks!

zhztheplayer · 2024-06-11T05:55:40Z

@XinShuoWang I'll merge this then do some follow-ups. Let me know if any comments, thanks.

Also thank you guys for reviewing. :)

zhztheplayer · 2024-06-11T05:57:32Z

/Benchmark Velox TPCDS

The PR disabled the feature by default. Let's merge this in advance and benchmark the subsequent PRs.

apache deleted a comment from github-actions bot Jun 7, 2024

zhztheplayer force-pushed the wip-shuffle-combine branch from 3be5ae8 to b6354c5 Compare June 7, 2024 02:46

apache deleted a comment from github-actions bot Jun 7, 2024

zhztheplayer mentioned this pull request Jun 7, 2024

[VL] Optimize the performance of hash based shuffle by accumulating batches #5951

Merged

apache deleted a comment from github-actions bot Jun 7, 2024

zhztheplayer added 5 commits June 11, 2024 09:32

[VL] Add option to combine small batches before sending to shuffle

7915726

fixup

78dc595

fixup

7b85dc1

fixup

e389513

Revert "[VL] Optimize the performance of hash based shuffle by accumu…

9e26c72

…lating batches" This reverts commit ec3e92e.

zhztheplayer force-pushed the wip-shuffle-combine branch from 626e837 to 9e26c72 Compare June 11, 2024 01:52

zhztheplayer marked this pull request as ready for review June 11, 2024 01:52

zhztheplayer commented Jun 11, 2024

View reviewed changes

marin-ma approved these changes Jun 11, 2024

View reviewed changes

zhztheplayer merged commit 1d0bb52 into apache:main Jun 11, 2024
42 checks passed

zhztheplayer mentioned this pull request Jun 12, 2024

[VL] Set s.g.s.c.b.v.coalesceBatchesBeforeShuffle=true by default #6056

Merged

This was referenced Jul 29, 2024

[VL] Use conf to control C2R occupied memory #5952

Merged

[VL] Allow specifying maximum batch size for batch resizing #6670

Merged

Surbhi-Vijay mentioned this pull request Aug 2, 2024

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Provide options to combine small batches before sending to shuffle #6009

[VL] Provide options to combine small batches before sending to shuffle #6009

zhztheplayer commented Jun 7, 2024 •

edited

Loading

zhztheplayer commented Jun 7, 2024

marin-ma commented Jun 7, 2024

github-actions bot commented Jun 11, 2024

Yohahaha commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

ulysses-you commented Jun 11, 2024

zhztheplayer Jun 11, 2024

zhztheplayer Jun 11, 2024

zhztheplayer Jun 11, 2024

zhztheplayer Jun 11, 2024

Yohahaha commented Jun 11, 2024

zhli1142015 commented Jun 11, 2024 •

edited

Loading

Yohahaha commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 •

edited

Loading

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 •

edited

Loading

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024 •

edited

Loading

marin-ma left a comment

zhztheplayer commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 •

edited

Loading

[VL] Provide options to combine small batches before sending to shuffle #6009

[VL] Provide options to combine small batches before sending to shuffle #6009

Conversation

zhztheplayer commented Jun 7, 2024 • edited Loading

Comparisons

zhztheplayer commented Jun 7, 2024

marin-ma commented Jun 7, 2024

github-actions bot commented Jun 11, 2024

Yohahaha commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

ulysses-you commented Jun 11, 2024

zhztheplayer Jun 11, 2024

Choose a reason for hiding this comment

zhztheplayer Jun 11, 2024

Choose a reason for hiding this comment

zhztheplayer Jun 11, 2024

Choose a reason for hiding this comment

zhztheplayer Jun 11, 2024

Choose a reason for hiding this comment

Yohahaha commented Jun 11, 2024

zhli1142015 commented Jun 11, 2024 • edited Loading

Yohahaha commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 • edited Loading

FelixYBW commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 • edited Loading

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024

FelixYBW commented Jun 11, 2024 • edited Loading

marin-ma left a comment

Choose a reason for hiding this comment

zhztheplayer commented Jun 11, 2024

zhztheplayer commented Jun 11, 2024 • edited Loading

zhztheplayer commented Jun 7, 2024 •

edited

Loading

zhli1142015 commented Jun 11, 2024 •

edited

Loading

zhztheplayer commented Jun 11, 2024 •

edited

Loading

zhztheplayer commented Jun 11, 2024 •

edited

Loading

FelixYBW commented Jun 11, 2024 •

edited

Loading

zhztheplayer commented Jun 11, 2024 •

edited

Loading