-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
softban global rayon thread pool #4139
base: master
Are you sure you want to change the base?
Conversation
accounts-db/src/accounts_db.rs
Outdated
dashmap::mapref::entry::Entry::Occupied(mut occupied_entry) => { | ||
if !occupied_entry.get().iter().any(|s| s == &slot) { | ||
occupied_entry.get_mut().push(slot); | ||
self.thread_pool.install(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exhaustively_verify_refcounts()
is test/debug-only code. IMO I wouldn't change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the background hash pool for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is not running in normal modes, maybe it should have its own pool sized appropriately for its work (all cores, low priority)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this to thread_pool_clean as well.
accounts-db/src/accounts_db.rs
Outdated
}) | ||
.flatten() | ||
.collect::<Vec<_>>() | ||
self.thread_pool.install(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not hardcode the foreground threadpool here, as this can be called from both foreground and background tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Background process such as shrinking may call this function. So we don't want to use foreground thread_pool in such case.
It seems that we might need to pass a flag down to decide what thread pool to use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better way is to make this function accept threadpool reference as argument so caller has to decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sound reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- a. forground (store_unfrozen)
1. TX foreground process always pass "Inline". so it will not use threads.
store_cached_inline (tx processing)
- no thread
2. individual account store on bank.store_account()
- no thread. only one account at a time.
3. bulk accounts store on bank bank.store_accounts, e.g. epoch rewards, rents
(no more)
- may use thread
-- b. background (store_frozen)
- store ancient --- we can use "clean_pool" too.
- shrink ---- in this case, upper in the call chain, `shrink_candidate_slots()`, we already use thread_pool_clean.
- flush cache -- we can use "clean_pool" too.
So we can use "clean_pool" for (b) and for a.3 we want to use a different
pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current code it would appear there are only 2 paths to update_index, one that sets threading policy to single thread, and the one from store_frozen. Based on this, I've set the function to default to thread_pool_clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't want a.3
to end up using thread_pool_clean
. a.3
is on the critical path for foreground processing.
We did something like this before and it caused crashes because of the thread overflowing its stack: |
Something like this what? The accountsdb code is already using dedicated pools in master today, there are only a couple of things that run almost never that seem to have been accidentally left in the global pool. |
It looks like the issue is solana-labs#25053, which unfortunately doesn't include the panic. If moving something from pool A to pool B causes a panic, it sounds like that's the thing that should be investigated and fixed? |
I have this version running on an unstaked mainnet validator for several hours with no observable problems. Issues mentioned be Behzad did not materialize. Which additiinal tests are needed to make sure this will not blow up in prod? |
38b6afa
to
7dbe6b5
Compare
Also same for exhaustively_verify_refcounts
7dbe6b5
to
0ae42c9
Compare
Problem
Summary of Changes
Discussion is welcome! Forking/patching rayon could be considered, for example.