[GLUTEN-8244][CORE] Softaffinity use consistent hash schedule #8245

yikf · 2024-12-16T07:55:37Z

What changes were proposed in this pull request?

#8244, This PR aims to make softaffinity use consistent hash schedule.

How was this patch tested?

unit tests and GA.

github-actions · 2024-12-16T07:56:02Z

#8244

github-actions · 2024-12-16T07:56:17Z

Run Gluten Clickhouse CI on x86

yikf · 2024-12-16T07:56:19Z

@zzcclp @jackylee-ch @zhztheplayer @PHILO-HE Could you please take a look, thanks!

github-actions · 2024-12-16T08:29:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-16T11:47:33Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-16T12:20:34Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-17T06:37:55Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2024-12-17T07:39:18Z

Naive question: why we need consistent hash here? Is it for avoiding cache misses to the max extent when cluster gets changed?

yikf · 2024-12-17T09:40:56Z

Naive question: why we need consistent hash here? Is it for avoiding cache misses to the max extent when cluster gets changed?

Yes, especially when using the spark dynamic executor.

github-actions · 2024-12-17T09:44:14Z

Run Gluten Clickhouse CI on x86

FelixYBW · 2024-12-17T19:46:08Z

Run Gluten Clickhouse CI on x86

FelixYBW · 2024-12-17T19:47:46Z

What does vanilla spark do here? Is it an enhancement of Spark scheduling?

...g/apache/spark/sql/execution/datasources/clickhouse/utils/MergeTreePartsPartitionsUtil.scala

jackylee-ch · 2024-12-18T02:42:17Z

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement. For the ch backend, can consistent hashing significantly improve the performance and resove the cache miss? @zzcclp @loneylee

zhztheplayer · 2024-12-18T03:02:45Z

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement.

Thank you for helping confirm. Also, is it possible that the local cache can be benefited?

Yes, especially when using the spark dynamic executor.

Yes I assume this is another key point. Dynamic Allocation ON + Local Cache ON, could this be a typical target scenario of the change? @yikf are you using local cache so far?

jackylee-ch · 2024-12-18T03:51:04Z

Also, is it possible that the local cache can be benefited?

Yes. With consistent hashing, we can schedule the cache location in advance, which is better than knowing the cache location after execution. And consistent hashing is useful for cluster changes, especially when executors are rescheduled due to executor deaths.

yikf · 2024-12-18T06:13:47Z

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement.

Thank you for helping confirm. Also, is it possible that the local cache can be benefited?

Yes, especially when using the spark dynamic executor.

Yes I assume this is another key point. Dynamic Allocation ON + Local Cache ON, could this be a typical target scenario of the change? @yikf are you using local cache so far?

As @jackylee-ch said, pure soft affinity scheduling using consistent hashing would be better than the current logic in scenarios where executors change, which is the charm of consistent hashing. for local cache, I assume that there will be some benefits from the scheduling optimization as well.

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

github-actions · 2024-12-18T06:14:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-18T06:20:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-18T12:54:07Z

Run Gluten Clickhouse CI on x86

FelixYBW · 2024-12-18T23:14:32Z

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

Is it an enhancement of vanilla Spark as well? or Gluten only?

jackylee-ch · 2024-12-19T01:44:29Z

Is it an enhancement of vanilla Spark as well? or Gluten only?

AFAIK, it is only used for Gluten, vanilla Spark doesn't need this.

yikf · 2024-12-19T02:59:44Z

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

Is it an enhancement of vanilla Spark as well? or Gluten only?

It is currently only using gluten, but the soft affinity scheduling is relatively independent, and we can optimize it internally with the internal Spark version.

github-actions · 2024-12-19T06:11:53Z

Run Gluten Clickhouse CI on x86

yikf · 2024-12-20T02:57:32Z

@jackylee-ch @zhztheplayer @FelixYBW Hi, Could you please take a look another?

jackylee-ch · 2024-12-20T02:44:25Z

gluten-core/src/main/scala/org/apache/gluten/softaffinity/SoftAffinityManager.scala

@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging {
        }
      })

+  private var hashRing: Option[ConsistentHash[ExecutorNode]] = _


Why we need an Option here? Since we must init the hashRing, why not directly defined here?

Because the strategy is fundamentally scalable, and there may be other strategies in the future besides the consistentHash strategy, I think we should keep a scalable implementation.

Okey. Then maybe move the hashRing to ConsistentHashSoftAffinityStrategy, which I think will be better expanded

Thanks for you suggestion, only the manager is aware of the changes in the executor now, moving to the strategy requires that the strategy has the relevant interfaces for the executor's changes. the current soft affinity code is not very friendly in terms of extensibility, and we can refactor it in subsequent PRs to achieve better extensibility.

gluten-core/src/main/java/org/apache/gluten/hash/ConsistentHash.java

jackylee-ch · 2024-12-20T03:01:48Z

Basically look good to me, left a few comments.
BTW, have you test this pr with duplicateReadingDetect?

jackylee-ch · 2024-12-20T03:04:41Z

Also cc @zzcclp @zhli1142015 , any more question about this pr?

yikf · 2024-12-20T03:05:56Z

Basically look good to me, left a few comments. BTW, have you test this pr with duplicateReadingDetect?

The UTs contains the duplicateReadingDetect suite, and it passed the UTs validation.

zhli1142015 · 2024-12-20T03:20:46Z

LGTM.
BTW, I saw you also use SA with local cache, which cache do you use?

yikf · 2024-12-20T03:29:04Z

LGTM. BTW, I saw you also use SA with local cache, which cache do you use?

We use Velox cache on TPC-DS, mainly with SSDs and RAM, and we are currently exploring scenarios for the online environment.

github-actions · 2024-12-20T03:31:25Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-20T03:32:34Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-20T03:53:20Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-20T04:55:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-20T05:15:51Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-20T05:16:24Z

Run Gluten Clickhouse CI on x86

zzcclp · 2024-12-20T05:55:26Z

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement. For the ch backend, can consistent hashing significantly improve the performance and resove the cache miss? @zzcclp @loneylee

For the ch backend, the consistent hashing will reduce the cache missing, I don't know what about the performance.

jackylee-ch

LGTM.

zhztheplayer

👍

zhztheplayer · 2024-12-20T08:57:11Z

gluten-core/src/main/scala/org/apache/gluten/softaffinity/SoftAffinityManager.scala

@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging {
        }
      })

+  private var hashRing: Option[ConsistentHash[ExecutorNode]] = None


Can we make it a val? Likely it's only set once.

thanks, addressed.

github-actions · 2024-12-21T02:29:02Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-21T02:29:43Z

Run Gluten Clickhouse CI on x86

github-actions bot added the CORE works for Gluten Core label Dec 16, 2024

yikf force-pushed the consistent-hash branch from 3f6c48e to 072fb15 Compare December 16, 2024 11:47

github-actions bot added the CLICKHOUSE label Dec 16, 2024

yikf force-pushed the consistent-hash branch from 072fb15 to 5cf69ce Compare December 16, 2024 12:20

yikf force-pushed the consistent-hash branch from 5cf69ce to 8986da8 Compare December 17, 2024 06:37

jackylee-ch reviewed Dec 18, 2024

View reviewed changes

...g/apache/spark/sql/execution/datasources/clickhouse/utils/MergeTreePartsPartitionsUtil.scala Outdated Show resolved Hide resolved

...g/apache/spark/sql/execution/datasources/clickhouse/utils/MergeTreePartsPartitionsUtil.scala Outdated Show resolved Hide resolved

yikf force-pushed the consistent-hash branch from ee4dfad to 23a585e Compare December 18, 2024 06:20

jackylee-ch reviewed Dec 20, 2024

View reviewed changes

yikf force-pushed the consistent-hash branch from 52e3cea to 6ada458 Compare December 20, 2024 03:30

yikf force-pushed the consistent-hash branch from 146cc6f to 6e37c48 Compare December 20, 2024 03:52

yikf force-pushed the consistent-hash branch from fa27efb to 28791b4 Compare December 20, 2024 05:15

jackylee-ch approved these changes Dec 20, 2024

View reviewed changes

zhztheplayer approved these changes Dec 20, 2024

View reviewed changes

consistent hash

46f9660

yikf force-pushed the consistent-hash branch from 3310a78 to 46f9660 Compare December 21, 2024 02:28

Merge branch 'main' into consistent-hash

abb191f

jackylee-ch merged commit 6c38842 into apache:main Dec 23, 2024
46 checks passed

yikf deleted the consistent-hash branch December 23, 2024 07:59

[GLUTEN-8244][CORE] Softaffinity use consistent hash schedule #8245

[GLUTEN-8244][CORE] Softaffinity use consistent hash schedule #8245

Conversation

yikf commented Dec 16, 2024

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

yikf commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

github-actions bot commented Dec 17, 2024

zhztheplayer commented Dec 17, 2024 • edited Loading

yikf commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

FelixYBW commented Dec 17, 2024

FelixYBW commented Dec 17, 2024

jackylee-ch commented Dec 18, 2024

zhztheplayer commented Dec 18, 2024

jackylee-ch commented Dec 18, 2024

yikf commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

FelixYBW commented Dec 18, 2024

jackylee-ch commented Dec 19, 2024

yikf commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

yikf commented Dec 20, 2024

jackylee-ch Dec 20, 2024

Choose a reason for hiding this comment

yikf Dec 20, 2024

Choose a reason for hiding this comment

jackylee-ch Dec 20, 2024

Choose a reason for hiding this comment

yikf Dec 20, 2024

Choose a reason for hiding this comment

jackylee-ch commented Dec 20, 2024

jackylee-ch commented Dec 20, 2024

yikf commented Dec 20, 2024

zhli1142015 commented Dec 20, 2024

yikf commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

zzcclp commented Dec 20, 2024

jackylee-ch left a comment

Choose a reason for hiding this comment

zhztheplayer left a comment

Choose a reason for hiding this comment

zhztheplayer Dec 20, 2024

Choose a reason for hiding this comment

yikf Dec 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 21, 2024

github-actions bot commented Dec 21, 2024

zhztheplayer commented Dec 17, 2024 •

edited

Loading