Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

softban global rayon thread pool #4139

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

Problem

  • Global thread pool in rayon is difficult to trace to particular bits of code that might be using it. This makes figuring out implications of priorities and thread counts in that pool harder to predict. To that end, the idea is to soft-ban its usage by setting it to only run one thread. There is, unfortunately, no mechanism to permanently disable it short of forking rayon.

Summary of Changes

  • Use one of accountsdB existing thread pools (solAccounts) instead of global pool
  • Set global pool to 1 thread to discourage its use

Discussion is welcome! Forking/patching rayon could be considered, for example.

dashmap::mapref::entry::Entry::Occupied(mut occupied_entry) => {
if !occupied_entry.get().iter().any(|s| s == &slot) {
occupied_entry.get_mut().push(slot);
self.thread_pool.install(|| {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exhaustively_verify_refcounts() is test/debug-only code. IMO I wouldn't change this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the background hash pool for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is not running in normal modes, maybe it should have its own pool sized appropriately for its work (all cores, low priority)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this to thread_pool_clean as well.

})
.flatten()
.collect::<Vec<_>>()
self.thread_pool.install(|| {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not hardcode the foreground threadpool here, as this can be called from both foreground and background tasks.

Copy link

@HaoranYi HaoranYi Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background process such as shrinking may call this function. So we don't want to use foreground thread_pool in such case.

It seems that we might need to pass a flag down to decide what thread pool to use here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better way is to make this function accept threadpool reference as argument so caller has to decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound reasonable to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- a. forground (store_unfrozen)

1. TX foreground process always pass "Inline". so it will not use threads.
store_cached_inline (tx processing)
    - no thread

2. individual account store on bank.store_account()
    - no thread. only one account at a time.

3. bulk accounts store on bank bank.store_accounts, e.g. epoch rewards, rents
(no more)
    - may use thread 

-- b. background (store_frozen)
    - store ancient --- we can use "clean_pool" too.
    - shrink ---- in this case, upper in the call chain, `shrink_candidate_slots()`, we already use thread_pool_clean.
    - flush cache -- we can use "clean_pool" too.


So we can use "clean_pool" for (b) and for a.3 we want to use a different
pool. 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code it would appear there are only 2 paths to update_index, one that sets threading policy to single thread, and the one from store_frozen. Based on this, I've set the function to default to thread_pool_clean.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want a.3 to end up using thread_pool_clean. a.3 is on the critical path for foreground processing.

@behzadnouri
Copy link

We did something like this before and it caused crashes because of the thread overflowing its stack:
solana-labs#24954
solana-labs#25481
solana-labs#25053

@HaoranYi HaoranYi self-requested a review December 17, 2024 01:02
@alessandrod
Copy link

We did something like this before and it caused crashes because of the thread overflowing its stack: solana-labs#24954 solana-labs#25481 solana-labs#25053

Something like this what? The accountsdb code is already using dedicated pools in master today, there are only a couple of things that run almost never that seem to have been accidentally left in the global pool.

@alessandrod
Copy link

We did something like this before and it caused crashes because of the thread overflowing its stack: solana-labs#24954 solana-labs#25481 solana-labs#25053

Something like this what? The accountsdb code is already using dedicated pools in master today, there are only a couple of things that run almost never that seem to have been accidentally left in the global pool.

It looks like the issue is solana-labs#25053, which unfortunately doesn't include the panic.

If moving something from pool A to pool B causes a panic, it sounds like that's the thing that should be investigated and fixed?

@alexpyattaev
Copy link
Author

I have this version running on an unstaked mainnet validator for several hours with no observable problems. Issues mentioned be Behzad did not materialize. Which additiinal tests are needed to make sure this will not blow up in prod?

@alexpyattaev alexpyattaev force-pushed the ban_default_rayon_pool branch from 38b6afa to 7dbe6b5 Compare December 18, 2024 10:28
Also same for exhaustively_verify_refcounts
@alexpyattaev alexpyattaev force-pushed the ban_default_rayon_pool branch from 7dbe6b5 to 0ae42c9 Compare December 18, 2024 11:00
@alexpyattaev alexpyattaev marked this pull request as ready for review December 18, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants