Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8244][CORE] Softaffinity use consistent hash schedule #8245

Merged
merged 2 commits into from
Dec 23, 2024

Conversation

yikf
Copy link
Contributor

@yikf yikf commented Dec 16, 2024

What changes were proposed in this pull request?

#8244, This PR aims to make softaffinity use consistent hash schedule.

How was this patch tested?

unit tests and GA.

@github-actions github-actions bot added the CORE works for Gluten Core label Dec 16, 2024
Copy link

#8244

Copy link

Run Gluten Clickhouse CI on x86

@yikf
Copy link
Contributor Author

yikf commented Dec 16, 2024

@zzcclp @jackylee-ch @zhztheplayer @PHILO-HE Could you please take a look, thanks!

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

zhztheplayer commented Dec 17, 2024

Naive question: why we need consistent hash here? Is it for avoiding cache misses to the max extent when cluster gets changed?

@yikf
Copy link
Contributor Author

yikf commented Dec 17, 2024

Naive question: why we need consistent hash here? Is it for avoiding cache misses to the max extent when cluster gets changed?

Yes, especially when using the spark dynamic executor.

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@FelixYBW
Copy link
Contributor

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Contributor

What does vanilla spark do here? Is it an enhancement of Spark scheduling?

@jackylee-ch
Copy link
Contributor

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement. For the ch backend, can consistent hashing significantly improve the performance and resove the cache miss? @zzcclp @loneylee

@zhztheplayer
Copy link
Member

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement.

Thank you for helping confirm. Also, is it possible that the local cache can be benefited?

Yes, especially when using the spark dynamic executor.

Yes I assume this is another key point. Dynamic Allocation ON + Local Cache ON, could this be a typical target scenario of the change? @yikf are you using local cache so far?

@jackylee-ch
Copy link
Contributor

Also, is it possible that the local cache can be benefited?

Yes. With consistent hashing, we can schedule the cache location in advance, which is better than knowing the cache location after execution. And consistent hashing is useful for cluster changes, especially when executors are rescheduled due to executor deaths.

@yikf
Copy link
Contributor Author

yikf commented Dec 18, 2024

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement.

Thank you for helping confirm. Also, is it possible that the local cache can be benefited?

Yes, especially when using the spark dynamic executor.

Yes I assume this is another key point. Dynamic Allocation ON + Local Cache ON, could this be a typical target scenario of the change? @yikf are you using local cache so far?

As @jackylee-ch said, pure soft affinity scheduling using consistent hashing would be better than the current logic in scenarios where executors change, which is the charm of consistent hashing. for local cache, I assume that there will be some benefits from the scheduling optimization as well.

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Contributor

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

Is it an enhancement of vanilla Spark as well? or Gluten only?

@jackylee-ch
Copy link
Contributor

Is it an enhancement of vanilla Spark as well? or Gluten only?

AFAIK, it is only used for Gluten, vanilla Spark doesn't need this.

@yikf
Copy link
Contributor Author

yikf commented Dec 19, 2024

We have used it in the TPC-DS, and we are also looking for scenarios in our online environment. Additionally, we would like to use soft affinity scheduling to optimize other scheduling within our organization.

Is it an enhancement of vanilla Spark as well? or Gluten only?

It is currently only using gluten, but the soft affinity scheduling is relatively independent, and we can optimize it internally with the internal Spark version.

Copy link

Run Gluten Clickhouse CI on x86

@yikf
Copy link
Contributor Author

yikf commented Dec 20, 2024

@jackylee-ch @zhztheplayer @FelixYBW Hi, Could you please take a look another?

@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging {
}
})

private var hashRing: Option[ConsistentHash[ExecutorNode]] = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need an Option here? Since we must init the hashRing, why not directly defined here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the strategy is fundamentally scalable, and there may be other strategies in the future besides the consistentHash strategy, I think we should keep a scalable implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey. Then maybe move the hashRing to ConsistentHashSoftAffinityStrategy, which I think will be better expanded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you suggestion, only the manager is aware of the changes in the executor now, moving to the strategy requires that the strategy has the relevant interfaces for the executor's changes. the current soft affinity code is not very friendly in terms of extensibility, and we can refactor it in subsequent PRs to achieve better extensibility.

@jackylee-ch
Copy link
Contributor

Basically look good to me, left a few comments.
BTW, have you test this pr with duplicateReadingDetect?

@jackylee-ch
Copy link
Contributor

Also cc @zzcclp @zhli1142015 , any more question about this pr?

@yikf
Copy link
Contributor Author

yikf commented Dec 20, 2024

Basically look good to me, left a few comments. BTW, have you test this pr with duplicateReadingDetect?

The UTs contains the duplicateReadingDetect suite, and it passed the UTs validation.

@zhli1142015
Copy link
Contributor

LGTM.
BTW, I saw you also use SA with local cache, which cache do you use?

@yikf
Copy link
Contributor Author

yikf commented Dec 20, 2024

LGTM. BTW, I saw you also use SA with local cache, which cache do you use?

We use Velox cache on TPC-DS, mainly with SSDs and RAM, and we are currently exploring scenarios for the online environment.

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@zzcclp
Copy link
Contributor

zzcclp commented Dec 20, 2024

AFAIK, the consistent hashing cannot solve the cache miss problem for velox backend, but can bring a small improvement. For the ch backend, can consistent hashing significantly improve the performance and resove the cache miss? @zzcclp @loneylee

For the ch backend, the consistent hashing will reduce the cache missing, I don't know what about the performance.

Copy link
Contributor

@jackylee-ch jackylee-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -88,6 +90,13 @@ abstract class AffinityManager extends LogLevelUtil with Logging {
}
})

private var hashRing: Option[ConsistentHash[ExecutorNode]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it a val? Likely it's only set once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, addressed.

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch merged commit 6c38842 into apache:main Dec 23, 2024
46 checks passed
@yikf yikf deleted the consistent-hash branch December 23, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants