softban global rayon thread pool #4139

alexpyattaev · 2024-12-16T11:56:31Z

Problem

Global thread pool in rayon is difficult to trace to particular bits of code that might be using it. This makes figuring out implications of priorities and thread counts in that pool harder to predict. To that end, the idea is to soft-ban its usage by setting it to only run one thread. There is, unfortunately, no mechanism to permanently disable it short of forking rayon.

Summary of Changes

Use one of accountsdB existing thread pools (solAccounts) instead of global pool
Set global pool to 1 thread to discourage its use

Discussion is welcome! Forking/patching rayon could be considered, for example.

brooksprumo · 2024-12-16T14:18:04Z

accounts-db/src/accounts_db.rs

-                        dashmap::mapref::entry::Entry::Occupied(mut occupied_entry) => {
-                            if !occupied_entry.get().iter().any(|s| s == &slot) {
-                                occupied_entry.get_mut().push(slot);
+        self.thread_pool.install(|| {


exhaustively_verify_refcounts() is test/debug-only code. IMO I wouldn't change this.

Maybe we can use the background hash pool for this?

If this function is not running in normal modes, maybe it should have its own pool sized appropriately for its work (all cores, low priority)?

moved this to thread_pool_clean as well.

brooksprumo · 2024-12-16T14:19:53Z

accounts-db/src/accounts_db.rs

-                })
-                .flatten()
-                .collect::<Vec<_>>()
+            self.thread_pool.install(|| {


We should not hardcode the foreground threadpool here, as this can be called from both foreground and background tasks.

Background process such as shrinking may call this function. So we don't want to use foreground thread_pool in such case.

It seems that we might need to pass a flag down to decide what thread pool to use here?

Maybe a better way is to make this function accept threadpool reference as argument so caller has to decide?

sound reasonable to me.

-- a. forground (store_unfrozen) 1. TX foreground process always pass "Inline". so it will not use threads. store_cached_inline (tx processing) - no thread 2. individual account store on bank.store_account() - no thread. only one account at a time. 3. bulk accounts store on bank bank.store_accounts, e.g. epoch rewards, rents (no more) - may use thread -- b. background (store_frozen) - store ancient --- we can use "clean_pool" too. - shrink ---- in this case, upper in the call chain, `shrink_candidate_slots()`, we already use thread_pool_clean. - flush cache -- we can use "clean_pool" too. So we can use "clean_pool" for (b) and for a.3 we want to use a different pool.

In the current code it would appear there are only 2 paths to update_index, one that sets threading policy to single thread, and the one from store_frozen. Based on this, I've set the function to default to thread_pool_clean.

I think we don't want a.3 to end up using thread_pool_clean. a.3 is on the critical path for foreground processing.

behzadnouri · 2024-12-16T16:45:50Z

We did something like this before and it caused crashes because of the thread overflowing its stack:
solana-labs#24954
solana-labs#25481
solana-labs#25053

alessandrod · 2024-12-17T02:31:41Z

We did something like this before and it caused crashes because of the thread overflowing its stack: solana-labs#24954 solana-labs#25481 solana-labs#25053

Something like this what? The accountsdb code is already using dedicated pools in master today, there are only a couple of things that run almost never that seem to have been accidentally left in the global pool.

alessandrod · 2024-12-17T02:43:43Z

We did something like this before and it caused crashes because of the thread overflowing its stack: solana-labs#24954 solana-labs#25481 solana-labs#25053

Something like this what? The accountsdb code is already using dedicated pools in master today, there are only a couple of things that run almost never that seem to have been accidentally left in the global pool.

It looks like the issue is solana-labs#25053, which unfortunately doesn't include the panic.

If moving something from pool A to pool B causes a panic, it sounds like that's the thing that should be investigated and fixed?

alexpyattaev · 2024-12-17T06:23:05Z

I have this version running on an unstaked mainnet validator for several hours with no observable problems. Issues mentioned be Behzad did not materialize. Which additiinal tests are needed to make sure this will not blow up in prod?

Also same for exhaustively_verify_refcounts

brooksprumo reviewed Dec 16, 2024

View reviewed changes

HaoranYi self-requested a review December 17, 2024 01:02

softban global rayon thread pool

6471f90

alexpyattaev force-pushed the ban_default_rayon_pool branch from 38b6afa to 7dbe6b5 Compare December 18, 2024 10:28

use background thread_pool_clean for update_index

0ae42c9

Also same for exhaustively_verify_refcounts

alexpyattaev force-pushed the ban_default_rayon_pool branch from 7dbe6b5 to 0ae42c9 Compare December 18, 2024 11:00

alexpyattaev marked this pull request as ready for review December 18, 2024 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

softban global rayon thread pool #4139

softban global rayon thread pool #4139

alexpyattaev commented Dec 16, 2024

brooksprumo Dec 16, 2024

HaoranYi Dec 17, 2024

alexpyattaev Dec 17, 2024

alexpyattaev Dec 18, 2024

brooksprumo Dec 16, 2024

HaoranYi Dec 17, 2024 •

edited

Loading

alexpyattaev Dec 17, 2024

HaoranYi Dec 17, 2024

HaoranYi Dec 17, 2024

alexpyattaev Dec 18, 2024

HaoranYi Dec 18, 2024

behzadnouri commented Dec 16, 2024

alessandrod commented Dec 17, 2024

alessandrod commented Dec 17, 2024

alexpyattaev commented Dec 17, 2024

softban global rayon thread pool #4139

Are you sure you want to change the base?

softban global rayon thread pool #4139

Conversation

alexpyattaev commented Dec 16, 2024

Problem

Summary of Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoranYi Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri commented Dec 16, 2024

alessandrod commented Dec 17, 2024

alessandrod commented Dec 17, 2024

alexpyattaev commented Dec 17, 2024

HaoranYi Dec 17, 2024 •

edited

Loading