[VL] Allow specifying maximum batch size for batch resizing #6670

zhztheplayer · 2024-08-01T06:19:05Z

This is a rework of #6009.

Rename operator VeloxAppendBatchesExec to VeloxResizeBatchesExec, and have it supporting batch-splitting as well as batch-appending.
Remove old config options spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle and spark.gluten.sql.columnar.backend.velox.minBatchSizeForShuffle
Add new config options:
- spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput=true
  Enable batch resizing for shuffle input
- spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput.range=500~1000
  Specify batch resizing target size range: [500, 1000]
Default range of shuffle batch resizing: [0.25 * 4096, 4 * 4096] (changed from [0.8 * 4096,])

github-actions · 2024-08-01T06:19:21Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-08-01T06:19:36Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T06:20:39Z

Run Gluten Clickhouse CI

Yohahaha · 2024-08-01T07:53:15Z

would you consider implement such logic into a Velox operator like ValueStream? then it can be reused in these place: before shuffle write, before agg, after scan. cc @zhli1142015 .

And lots java/scala changes can be removed.

zhztheplayer · 2024-08-01T08:08:19Z

would you consider implement such logic into a Velox operator like ValueStream?

It will be the next step of the topic. See a related comment from another PR.

The major pros of doing it in Velox is to avoid breaking one Velox task into several.

And lots java/scala changes can be removed.

Regarding the code, it's expected most of C++ code can be removed then. However it might still needed to have a JVM side plan node to map to Velox's corresponding plan node.

github-actions · 2024-08-01T08:11:29Z

Run Gluten Clickhouse CI

Yohahaha · 2024-08-01T08:22:05Z

would you consider implement such logic into a Velox operator like ValueStream?

It will be the next step of the topic. See a related comment from another PR.

The major pros of doing it in Velox is to avoid breaking one Velox task into several.

And lots java/scala changes can be removed.

Regarding the code, it's expected most of C++ code can be removed then. However it might still needed to have a JVM side plan node to map to Velox's corresponding plan node.

yeah, indeed, I have done a POC for above ideas, just one rule/transformer/velox operator added.

zhztheplayer · 2024-08-01T08:24:56Z

would you consider implement such logic into a Velox operator like ValueStream?

It will be the next step of the topic. See a related comment from another PR.
The major pros of doing it in Velox is to avoid breaking one Velox task into several.

And lots java/scala changes can be removed.

Regarding the code, it's expected most of C++ code can be removed then. However it might still needed to have a JVM side plan node to map to Velox's corresponding plan node.

yeah, indeed, I have done a POC for above ideas, just one rule/transformer/velox operator added.

That sounds great. Do you tend to contribute to upstream Velox? If so please feel free to open a PR to refresh the feature in Gluten once Velox change has landed.

Yohahaha · 2024-08-01T08:28:46Z

would you consider implement such logic into a Velox operator like ValueStream?

It will be the next step of the topic. See a related comment from another PR.
The major pros of doing it in Velox is to avoid breaking one Velox task into several.

And lots java/scala changes can be removed.

Regarding the code, it's expected most of C++ code can be removed then. However it might still needed to have a JVM side plan node to map to Velox's corresponding plan node.

yeah, indeed, I have done a POC for above ideas, just one rule/transformer/velox operator added.

That sounds great. Do you tend to contribute to upstream Velox? If so please feel free to open a PR to refresh the feature in Gluten once Velox change has landed.

yes, I'm preparing the codes.

github-actions · 2024-08-02T01:17:45Z

Run Gluten Clickhouse CI

marin-ma · 2024-08-02T06:47:39Z

cpp/velox/utils/VeloxBatchResizer.cc

+    if (remainingLength == 0) {
+      return nullptr;
+    }
+    int32_t sliceLength = std::min(maxOutputBatchSize_, remainingLength);


Why use maxOutputBatchSize_ but not default batch size (4096)?

This could be a good point. I was worrying if batch slicing brings extra overhead when size is just slightly larger than 4096, say 5000.

I could raise a PR later to change the value to a more reasonable one. Before that I think it's just OK to use a large value since it aligns with current code (which doesn't split at all) as well.

marin-ma · 2024-08-02T06:51:34Z

shims/common/src/main/scala/org/apache/gluten/GlutenConfig.scala

+      assert(pattern.count(_ == '~') == 1, s"Invalid range pattern for batch resizing: $pattern")
+      val splits = pattern.split('~')
+      assert(splits.length == 2)
+      ResizeRange(splits(0).toInt, splits(1).toInt)


Should we check the min size always < COLUMNAR_MAX_BATCH_SIZE and max size > COLUMNAR_MAX_BATCH_SIZE ?

Any reason doing such check?

To me it's not necessary to create tight coupling between this range and COLUMNAR_MAX_BATCH_SIZE, the former's value was just derived from the latter.

marin-ma

LGTM. Thanks!

FelixYBW · 2024-08-02T22:47:30Z

Sorry, just noted this PR.
The config of spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput.range=500~1000 is too complex. The maximal batch size should honor the configued batch size always. Slice has little overhead. we just need to define the threshold to merge batches.

FelixYBW · 2024-08-02T22:51:01Z

yes, I'm preparing the codes.

We can merge small batches into large one. But the slice shouldn't happen, instead we should fix the operator which doesn't honor the maximal batch size define. @jinchengchenghh will fix the generator operator. The other operator I know is the hashjoin.

zhztheplayer added 3 commits August 1, 2024 14:04

[VL] Allow specifying maximum batch size for batch resizing

8eb8a90

fixup

811919f

fixup

b2c8707

fixup

7d15da9

fixup

bde7c2c

fixup

33beb27

github-actions bot added CORE works for Gluten Core VELOX labels Aug 2, 2024

zhztheplayer marked this pull request as ready for review August 2, 2024 06:08

zhztheplayer assigned marin-ma Aug 2, 2024

marin-ma reviewed Aug 2, 2024

View reviewed changes

marin-ma approved these changes Aug 2, 2024

View reviewed changes

zhztheplayer merged commit 94b79c7 into apache:main Aug 2, 2024
50 checks passed

zhztheplayer mentioned this pull request Aug 2, 2024

[VL] Minor follow-ups for PRs #6693

Merged

JunhyungSong mentioned this pull request Oct 25, 2024

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Allow specifying maximum batch size for batch resizing #6670

[VL] Allow specifying maximum batch size for batch resizing #6670

zhztheplayer commented Aug 1, 2024 •

edited

Loading

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

github-actions bot commented Aug 2, 2024

marin-ma Aug 2, 2024

zhztheplayer Aug 2, 2024

marin-ma Aug 2, 2024

zhztheplayer Aug 2, 2024

marin-ma left a comment

FelixYBW commented Aug 2, 2024

FelixYBW commented Aug 2, 2024

[VL] Allow specifying maximum batch size for batch resizing #6670

[VL] Allow specifying maximum batch size for batch resizing #6670

Conversation

zhztheplayer commented Aug 1, 2024 • edited Loading

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

github-actions bot commented Aug 2, 2024

marin-ma Aug 2, 2024

Choose a reason for hiding this comment

zhztheplayer Aug 2, 2024

Choose a reason for hiding this comment

marin-ma Aug 2, 2024

Choose a reason for hiding this comment

zhztheplayer Aug 2, 2024

Choose a reason for hiding this comment

marin-ma left a comment

Choose a reason for hiding this comment

FelixYBW commented Aug 2, 2024

FelixYBW commented Aug 2, 2024

zhztheplayer commented Aug 1, 2024 •

edited

Loading