-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8244][CORE] Softaffinity use consistent hash schedule #8245
Conversation
Run Gluten Clickhouse CI on x86 |
@zzcclp @jackylee-ch @zhztheplayer @PHILO-HE Could you please take a look, thanks! |
Run Gluten Clickhouse CI on x86 |
3f6c48e
to
072fb15
Compare
Run Gluten Clickhouse CI on x86 |
072fb15
to
5cf69ce
Compare
Run Gluten Clickhouse CI on x86 |
5cf69ce
to
8986da8
Compare
Run Gluten Clickhouse CI on x86 |
Naive question: why we need consistent hash here? Is it for avoiding cache misses to the max extent when cluster gets changed? |
Yes, especially when using the spark dynamic executor. |
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
What does vanilla spark do here? Is it an enhancement of Spark scheduling? |
...g/apache/spark/sql/execution/datasources/clickhouse/utils/MergeTreePartsPartitionsUtil.scala
Outdated
Show resolved
Hide resolved
...g/apache/spark/sql/execution/datasources/clickhouse/utils/MergeTreePartsPartitionsUtil.scala
Outdated
Show resolved
Hide resolved
Thank you for helping confirm. Also, is it possible that the local cache can be benefited?
Yes I assume this is another key point. |
Yes. With consistent hashing, we can schedule the cache location in advance, which is better than knowing the cache location after execution. And consistent hashing is useful for cluster changes, especially when executors are rescheduled due to executor deaths. |
As @jackylee-ch said, pure soft affinity scheduling using consistent hashing would be better than the current logic in scenarios where executors change, which is the charm of consistent hashing. for local cache, I assume that there will be some benefits from the scheduling optimization as well. We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization. |
Run Gluten Clickhouse CI on x86 |
ee4dfad
to
23a585e
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
Is it an enhancement of vanilla Spark as well? or Gluten only? |
AFAIK, it is only used for Gluten, vanilla Spark doesn't need this. |
It is currently only using gluten, but the soft affinity scheduling is relatively independent, and we can optimize it internally with the internal Spark version. |
Run Gluten Clickhouse CI on x86 |
@jackylee-ch @zhztheplayer @FelixYBW Hi, Could you please take a look another? |
@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging { | |||
} | |||
}) | |||
|
|||
private var hashRing: Option[ConsistentHash[ExecutorNode]] = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need an Option here? Since we must init the hashRing, why not directly defined here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the strategy is fundamentally scalable, and there may be other strategies in the future besides the consistentHash strategy, I think we should keep a scalable implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey. Then maybe move the hashRing
to ConsistentHashSoftAffinityStrategy, which I think will be better expanded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you suggestion, only the manager is aware of the changes in the executor now, moving to the strategy requires that the strategy has the relevant interfaces for the executor's changes. the current soft affinity code is not very friendly in terms of extensibility, and we can refactor it in subsequent PRs to achieve better extensibility.
Basically look good to me, left a few comments. |
Also cc @zzcclp @zhli1142015 , any more question about this pr? |
The UTs contains the |
LGTM. |
We use Velox cache on TPC-DS, mainly with SSDs and RAM, and we are currently exploring scenarios for the online environment. |
52e3cea
to
6ada458
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
146cc6f
to
6e37c48
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
fa27efb
to
28791b4
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
For the ch backend, the consistent hashing will reduce the cache missing, I don't know what about the performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging { | |||
} | |||
}) | |||
|
|||
private var hashRing: Option[ConsistentHash[ExecutorNode]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it a val
? Likely it's only set once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, addressed.
3310a78
to
46f9660
Compare
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
What changes were proposed in this pull request?
#8244, This PR aims to make softaffinity use consistent hash schedule.
How was this patch tested?
unit tests and GA.