Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Provide options to combine small batches before sending to shuffle #6009

Merged
merged 5 commits into from
Jun 11, 2024

Conversation

zhztheplayer
Copy link
Member

@zhztheplayer zhztheplayer commented Jun 7, 2024

It's observed that Velox hash-based shuffle is slowed down by small input batches.

The patch:

  1. Adds two options:
    • spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle
      (Default: false) Set to true to combine small batches with minimal batch size determined by spark.gluten.sql.columnar.maxBatchSize. (Note the misnaming of maxBatchSize in Gluten, it might tend to be minBatchSize or just batchSize)
    • spark.gluten.sql.columnar.backend.velox.minBatchSizeForShuffle
      (Optional) Set to override the minimal batch size used by coalesceBatchesBeforeShuffle.
  2. Does essential code refactors and cleanups.

Comparisons

(spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle=false/true)

Q31 total time, before and after (SF1000 partitioned table, scan partitions 112, shuffle partitions 112):

image

Closer look at exchange, before and after:

image

image

@apache apache deleted a comment from github-actions bot Jun 7, 2024
@zhztheplayer zhztheplayer force-pushed the wip-shuffle-combine branch from 3be5ae8 to b6354c5 Compare June 7, 2024 02:46
@apache apache deleted a comment from github-actions bot Jun 7, 2024
@zhztheplayer
Copy link
Member Author

@marin-ma There might be some batch-wise overhead around shuffle split processing. We may want to figure it out later to avoid doing such batch coalesce operations that introduce extra copies.

@apache apache deleted a comment from github-actions bot Jun 7, 2024
@apache apache deleted a comment from github-actions bot Jun 7, 2024
@marin-ma
Copy link
Contributor

marin-ma commented Jun 7, 2024

cc: @WangGuangxin @FelixYBW

@zhztheplayer zhztheplayer force-pushed the wip-shuffle-combine branch from 626e837 to 9e26c72 Compare June 11, 2024 01:52
@zhztheplayer zhztheplayer marked this pull request as ready for review June 11, 2024 01:52
Copy link

Run Gluten Clickhouse CI

@Yohahaha
Copy link
Contributor

#5951 (comment)
I see above comments try add a new operator to collect small batch to a bigger batch, is will be included in this PR?

@zhztheplayer
Copy link
Member Author

@XinShuoWang

Do you have any thoughts on moving to this approach since ec3e92e has been merged? I don't have strong preference except that we need a configuration and metrics for batch-appending.

I suggest merging this to make things configurable and displayable at first, then if we want to continue on #5951 's approach, you can open another PR to bring the code back and re-use the conf code added in this patch. And remove calls to maybeAddAppendBatchesExec to disable AppendBatchesExec for shuffle.

@zhztheplayer
Copy link
Member Author

#5951 (comment) I see above comments try add a new operator to collect small batch to a bigger batch, is will be included in this PR?

It's included, namely VeloxAppendBatchesExec.

Code https://github.com/apache/incubator-gluten/pull/6009/files#diff-e058841b0f54a15144e01a1089457b7e5a28a953f85df3a1e656fcf6eddfe203R38

@ulysses-you
Copy link
Contributor

/Benchmark Velox TPCDS

Comment on lines +69 to +125
std::unique_ptr<gluten::JniColumnarBatchIterator> gluten::makeJniColumnarBatchIterator(
JNIEnv* env,
jobject jColumnarBatchItr,
gluten::Runtime* runtime,
std::shared_ptr<ArrowWriter> writer) {
return std::make_unique<JniColumnarBatchIterator>(env, jColumnarBatchItr, runtime, writer);
}

gluten::JniColumnarBatchIterator::JniColumnarBatchIterator(
JNIEnv* env,
jobject jColumnarBatchItr,
gluten::Runtime* runtime,
std::shared_ptr<ArrowWriter> writer)
: runtime_(runtime), writer_(writer) {
// IMPORTANT: DO NOT USE LOCAL REF IN DIFFERENT THREAD
if (env->GetJavaVM(&vm_) != JNI_OK) {
std::string errorMessage = "Unable to get JavaVM instance";
throw gluten::GlutenException(errorMessage);
}
serializedColumnarBatchIteratorClass_ =
createGlobalClassReferenceOrError(env, "Lorg/apache/gluten/vectorized/ColumnarBatchInIterator;");
serializedColumnarBatchIteratorHasNext_ =
getMethodIdOrError(env, serializedColumnarBatchIteratorClass_, "hasNext", "()Z");
serializedColumnarBatchIteratorNext_ = getMethodIdOrError(env, serializedColumnarBatchIteratorClass_, "next", "()J");
jColumnarBatchItr_ = env->NewGlobalRef(jColumnarBatchItr);
}

gluten::JniColumnarBatchIterator::~JniColumnarBatchIterator() {
JNIEnv* env;
attachCurrentThreadAsDaemonOrThrow(vm_, &env);
env->DeleteGlobalRef(jColumnarBatchItr_);
env->DeleteGlobalRef(serializedColumnarBatchIteratorClass_);
vm_->DetachCurrentThread();
}

std::shared_ptr<gluten::ColumnarBatch> gluten::JniColumnarBatchIterator::next() {
JNIEnv* env;
attachCurrentThreadAsDaemonOrThrow(vm_, &env);
if (!env->CallBooleanMethod(jColumnarBatchItr_, serializedColumnarBatchIteratorHasNext_)) {
checkException(env);
return nullptr; // stream ended
}

checkException(env);
jlong handle = env->CallLongMethod(jColumnarBatchItr_, serializedColumnarBatchIteratorNext_);
checkException(env);
auto batch = runtime_->objectStore()->retrieve<ColumnarBatch>(handle);
if (writer_ != nullptr) {
// save snapshot of the batch to file
std::shared_ptr<ArrowSchema> schema = batch->exportArrowSchema();
std::shared_ptr<ArrowArray> array = batch->exportArrowArray();
auto rb = gluten::arrowGetOrThrow(arrow::ImportRecordBatch(array.get(), schema.get()));
GLUTEN_THROW_NOT_OK(writer_->initWriter(*(rb->schema().get())));
GLUTEN_THROW_NOT_OK(writer_->writeInBatches(rb));
}
return batch;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code movement

Comment on lines +261 to +294
class JniColumnarBatchIterator : public ColumnarBatchIterator {
public:
explicit JniColumnarBatchIterator(
JNIEnv* env,
jobject jColumnarBatchItr,
Runtime* runtime,
std::shared_ptr<ArrowWriter> writer);

// singleton
JniColumnarBatchIterator(const JniColumnarBatchIterator&) = delete;
JniColumnarBatchIterator(JniColumnarBatchIterator&&) = delete;
JniColumnarBatchIterator& operator=(const JniColumnarBatchIterator&) = delete;
JniColumnarBatchIterator& operator=(JniColumnarBatchIterator&&) = delete;

virtual ~JniColumnarBatchIterator();

std::shared_ptr<ColumnarBatch> next() override;

private:
JavaVM* vm_;
jobject jColumnarBatchItr_;
Runtime* runtime_;
std::shared_ptr<ArrowWriter> writer_;

jclass serializedColumnarBatchIteratorClass_;
jmethodID serializedColumnarBatchIteratorHasNext_;
jmethodID serializedColumnarBatchIteratorNext_;
};

std::unique_ptr<JniColumnarBatchIterator> makeJniColumnarBatchIterator(
JNIEnv* env,
jobject jColumnarBatchItr,
Runtime* runtime,
std::shared_ptr<ArrowWriter> writer);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code movement

Comment on lines +104 to +114
case p if TransformHints.isNotTransformable(p) =>
p
case s: ShuffleExchangeExec
if (s.child.supportsColumnar || GlutenConfig.getConf.enablePreferColumnar) &&
BackendsApiManager.getSettings.supportColumnarShuffleExec() =>
logDebug(s"Columnar Processing for ${s.getClass} is currently supported.")
BackendsApiManager.getSparkPlanExecApiInstance.genColumnarShuffleExchange(s)
case b: BroadcastExchangeExec =>
val child = b.child
logDebug(s"Columnar Processing for ${b.getClass} is currently supported.")
ColumnarBroadcastExchangeExec(b.mode, child)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code simplification

@@ -101,7 +101,7 @@ trait SparkPlanExecApi {
aggregateExpressions: Seq[AggregateExpression],
aggregateAttributes: Seq[Attribute]): HashAggregateExecPullOutBaseHelper

def genColumnarShuffleExchange(shuffle: ShuffleExchangeExec, newChild: SparkPlan): SparkPlan
def genColumnarShuffleExchange(shuffle: ShuffleExchangeExec): SparkPlan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API simplification

@Yohahaha
Copy link
Contributor

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.

my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

@zhli1142015
Copy link
Contributor

zhli1142015 commented Jun 11, 2024

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.

my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

I think filter and HashJoin(with filter) may also need this as well. Thanks.

@Yohahaha
Copy link
Contributor

I think the design is same as facebookincubator/velox#7801 and would be great to implement in Gluten side.
my suggestion is we could implement VeloxAppendBatchesExec first, then provide config for shuffle write and table scan to add this new operator.

I think filter and HashJoin(with filter) may also need this as well. Thanks.

yeah, above velox issue has list out all operators that may need apply this optimization, we could provide a config for these operators, which value is a list of operator name string.

@FelixYBW
Copy link
Contributor

spark.gluten.sql.columnar.maxBatchSize. (Note the misnaming of maxBatchSize in Gluten, it might tend to be minBatchSize)

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

@FelixYBW
Copy link
Contributor

@zhztheplayer Let's remove the option and set it as default behavior as long as it can benefit in all cases.

@FelixYBW
Copy link
Contributor

around shuffle split processing. We may want to figure it out later to avoid doing such batch coalesce operations that intro

It's because the initialization of current split function. Currently we use 3 loops (per column, per reducer, per row) to do the split, if the column data is cached then the solution is the best way to scale to reducer numbers. However to achieve this, we need much initialization work to create several vectors. If the input batch is small, we will suffer from the initialization overhead. Even bigger than the copy to bigger batches.

Another issue is if the data size is too large and exceeds the cache size, then performance will be very poor.

@zhztheplayer
Copy link
Member Author

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

It's planned but I'll use another PR to do that. Setting it true by default may fail some plan checks in UTs.

@FelixYBW
Copy link
Contributor

It's planned but I'll use another PR to do that. Setting it true by default may fail some plan checks in UTs.

Why UT fails? we should fix that.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Jun 11, 2024

Why UT fails? we should fix that.

The PR adds a new operator VeloxAppendBatchesExec. Plan checks may fail because of this new operator. It's benign and to fix this we just update the UT code.

@FelixYBW
Copy link
Contributor

checks may fail because of this new operator. It's benign and to fix this we just update the UT code.

Let's update the UT code then.

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Jun 11, 2024

Just comment this: minBatchSize is either not accurate. The accurate description of the batch size is "Velox will try best to limit the row numbers per rowVector to the maxBatchSize config, however, it's not guaranteed. @zhztheplayer can you update the document to highlight this?

The tricky part is that I don't see we always follow either min or max criteria when using the option. (correct me if I am wrong) It's used as min in shuffle reader, but may be used as max in shuffle writer. I don't go through scan's code so not sure about that part.

Maybe we can change the option name to targetBatchSize to clarify. Perhaps it's not that important whether a batch is slightly larger than or smaller than this size.

@FelixYBW
Copy link
Contributor

One misunderstanding is that "we should avoid memcpy as much as possible", but in fact Gluten isn't memory throughput bound yet. The sequential data read and write is a cheaper operation if the block size isn't too small (like several Bytes but not predictable, then the overhead is branch misprediction) or too large (like GB level).

@FelixYBW
Copy link
Contributor

e we can change the option name

In long term Velox should limit the batch size to maxBatchSize we configured, so as in Gluten. For operators like Combine, we may limit the batch size to 2 x maxBatchSize so we needn't to cache the second batch.

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 11, 2024

It's included, namely VeloxAppendBatchesExec.

Code https://github.com/apache/incubator-gluten/pull/6009/files#diff-e058841b0f54a15144e01a1089457b7e5a28a953f85df3a1e656fcf6eddfe203R38

We may need to propose this Exec to Velox

Copy link
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@zhztheplayer
Copy link
Member Author

@XinShuoWang I'll merge this then do some follow-ups. Let me know if any comments, thanks.

Also thank you guys for reviewing. :)

@zhztheplayer
Copy link
Member Author

zhztheplayer commented Jun 11, 2024

/Benchmark Velox TPCDS

The PR disabled the feature by default. Let's merge this in advance and benchmark the subsequent PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants