Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't load executable accounts for transactions #180

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

seanyoung
Copy link

@seanyoung seanyoung commented Mar 11, 2024

Problem

There is no need for loading executable accounts for transactions, if the account is already present in the program cache.

Summary of Changes

  • The program cache needs to track the size, rent epoch and lamports for the programs.
  • If an account is not loaded, then we pad the account size with 0s. This is because otherwise transactions use less compute units, and then transactions which fail on mainnet may suddenly pass in our validator. This makes checking for divergence possible.
  • There are a few programs running on mainnet which read the contents of the executable accounts. There is a an exception list for these accounts. Note that all executable accounts in that instruction needs to be loaded, since we do not know ahead of time which account will be read.
  • All programs currently in this list are upgradable
  • This code has been running for months in a mainnet validator without divergence. However, the code has been re-written a number of times - I completely changed the approach a number of times. The current version of the code has been running for about 3 weeks now.
  • I intend to let the validator with these code changes run continuously until the feature is activated

Future:

@seanyoung seanyoung force-pushed the dont-load-executables branch 6 times, most recently from fd06abf to e834852 Compare March 14, 2024 18:50
@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2024

Codecov Report

Attention: Patch coverage is 93.81107% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (cc3afa5) to head (2016e3a).

Additional details and impacted files
@@           Coverage Diff            @@
##           master     #180    +/-   ##
========================================
  Coverage    81.9%    81.9%            
========================================
  Files         837      837            
  Lines      226823   227064   +241     
========================================
+ Hits       185828   186091   +263     
+ Misses      40995    40973    -22     

@seanyoung seanyoung force-pushed the dont-load-executables branch 2 times, most recently from b162f7b to 2016e3a Compare March 18, 2024 09:11
@seanyoung seanyoung requested a review from Lichtso March 18, 2024 09:19
@seanyoung seanyoung marked this pull request as ready for review March 18, 2024 09:19
@Lichtso
Copy link

Lichtso commented Mar 20, 2024

Can you rebase it onto master and resolve the conflicts?

@seanyoung seanyoung force-pushed the dont-load-executables branch 2 times, most recently from 7175ff1 to 45a967b Compare March 21, 2024 10:58
@seanyoung
Copy link
Author

Can you rebase it onto master and resolve the conflicts?

It's been rebased

@seanyoung seanyoung force-pushed the dont-load-executables branch from 45a967b to 5c34e3c Compare March 22, 2024 09:05
@alessandrod
Copy link

alessandrod commented Mar 26, 2024

I only skimmed through the changes haven't done a proper review, but here's some initial comments since this ties to another perf issue I've been investigating.

There are a few programs running on mainnet which read the contents of the executable accounts. There is a an exception list for these accounts. Note that all executable accounts in that instruction needs to be loaded, since we do not know ahead of time which account will be read.

I don't think we can do this. It is acceptable to break those programs, but I don't think it's acceptable to have an exception list, because it's exploitable: there could be a dormant malicious program, not in the exception list, waiting for this feature to be enabled to trigger a consensus issue by accessing an executable program. EDIT: the PR adds a new feature gate, so this is fine.

Anyway, what brought me here is that I've been investigating account_matches_owners interacting with the accounts-db read cache, and causing a lot of major page faults (hitting disk).

I think the original approach of having separate filter_executable_program_accounts/load_accounts is not ideal. We introduced account_matches_owners() to "peek" into the accounts-db before we load the non-executable accounts, which effectively forces us to do two round trips into the accountsdb. We first make accounts-db iterate over all accounts in a message, then do another round trip where we make it iterate over all the non-executable accounts.

account_matches_owners() interacts with the internal caches in the accountsdb, pages in parts of the accounts index, pages in parts of appendvecs, etc, so this is not great perf wise.

Instead, I think we could have a load_accounts_with_filter(account_keys, |...| { ... }) and in the filter build the list of program accounts, and return a value to signal to the accounts-db to not load nor insert an account into its caches. What do you think?

@alessandrod
Copy link

With an upcoming fix for the accounts-db to avoid thrashing the readonly cache, account_matches_owners becomes the largest offender in tx execution, causing a lot of pagecache faults (and sleeps):

Screen Shot 2024-03-27 at 10 05 23 am

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which SIMDs do these feature gates implement?

@@ -785,6 +785,14 @@ pub mod deprecate_unused_legacy_vote_plumbing {
solana_sdk::declare_id!("6Uf8S75PVh91MYgPQSHnjRAPQq6an5BDv9vomrCwDqLe");
}

pub mod dont_load_executable_accounts {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't introduce two feature gates in the same pr. this is indicative of a change set that should be split up (as is the line change total)

@Lichtso
Copy link

Lichtso commented Mar 27, 2024

account_matches_owners becomes the largest offender in tx execution, causing a lot of pagecache faults (and sleeps)

That is because it is the first place to touch all the accounts of a TX.
If we were to remove the account_matches_owners() from filter_executable_program_accounts() then load_accounts() would in turn become the place that causes all of these, right?

But I agree that filter_executable_program_accounts(), replenish_program_cache() and load_accounts() should be refactored. IMO it is sufficient to only check the program cache in filter_executable_program_accounts() and swap the order of load_accounts() and replenish_program_cache(). I will try this out and PR it.

@alessandrod
Copy link

If we were to remove the account_matches_owners() from filter_executable_program_accounts() then load_accounts() would in turn become the place that causes all of these, right?

Nope because account_matches_owners explicitly skips parts of the accounts-db cache (but not all)

But I agree that filter_executable_program_accounts(), replenish_program_cache() and load_accounts() should be refactored. IMO it is sufficient to only check the program cache in filter_executable_program_accounts() and swap the order of load_accounts() and replenish_program_cache(). I will try this out and PR it.

It isn't because you still don't want executable accounts to occupy space in the cache if they're going to be cached elsewhere (the program cache). Since you never want load_accounts(executable) to cache, accounts-db needs to be directly involved, hence the load_accounts_with_filter() suggestion.

@Lichtso
Copy link

Lichtso commented Mar 27, 2024

I think we have roughly the same thing in mind. What I suggest is three phases:

  1. Call ProgramCache::extract() once with a list of all message accounts. The result is a list of keys which are not in the global program cache, either because they are not loaded or because they are not programs / owned by a loader.
  2. Pass that list into load_accounts() and load_transaction_accounts() where we then check the owner to eliminate all entries from the list which are not owned by a loader (e.g. by account_matches_owners()). This depends on account_matches_owners() being fixed to not load the entire account data. Entries which were found in the global program cache are not touched here.
  3. At that point we are left with a list that needs to be loaded. So replenish_program_cache() does its normal thing but uses an account read which does not insert entries in the accounts-db.

Still leaves us with the original issue of instruction accounts in CPI that this PR is trying to solve.

in the filter build the list of program accounts

Happens in phase 2.

and return a value to signal to the accounts-db to not load nor insert an account into its caches.

Happens in phase 3.


Another thing I noticed while tinkering on this is that account_shared_data_from_program() (which is used in load_transaction_accounts() and load_accounts()) does currently require program_accounts to resolve the account owner. That is for programs which are in the global program cache as opposed to the proposed phase 2, where we need the owner of accounts which are not in the global program cache. It can be solved by refactoring the LoadedProgram entries to split LoadedProgramType and the account owner (which is cleaner anyway) and use that to get the owner, instead of asking the accounts-db.

@seanyoung
Copy link
Author

seanyoung commented Mar 27, 2024

  1. At that point we are left with a list that needs to be loaded. So replenish_program_cache() does its normal thing but uses a uncached account read.

It is not clear to me:

  • fn load_program_accounts() uses callbacks.get_account_shared_data(pubkey), which is just a regular account read. How can we make this a "do not keep this in the cache" type of read?
  • For program account which appear as instruction accounts, we do want the account to be cached, else we have to read them again
  • Calling account_matches_owners() when we are going to call get_account_shared_data() anyway seems a waste of cycles and iops.

@Lichtso
Copy link

Lichtso commented Mar 27, 2024

fn load_program_accounts() uses callbacks.get_account_shared_data(pubkey), which is just a regular account read. How can we make this a "do not keep this in the cache" type of read?

Have to check the interfaces if such a method exists already, don't know.

For program account which appear as instruction accounts, we do want the account to be cached, else we have to read them again.

load_program_accounts() happens after load_accounts() so by that time the instruction accounts will already by cached anyway.

Calling account_matches_owners() when we are going to call get_account_shared_data() anyway seems a waste of cycles and iops.

Agreed, we can also check the owner of the loaded accounts.

@alessandrod
Copy link

I think we have roughly the same thing in mind. What I suggest is three phases:

I'm not sure that they are the same. Can you elaborate what's the advantage of doing this - which requires account_matches_owners which is inherently "look before you leap" - vs something like load_accounts_with_filter(), which would allow account_matches_owners to be removed altogether and do only one trip inside the accounts db, reducing locks, index lookups etc?

if you do if !account_matches_owners(key, owners) { load_account(key) } that's more work than if you do load_accounts(account_keys, filter) and filter() is invoked with the account. First case you have to find the account twice, second case once.

@Lichtso
Copy link

Lichtso commented Mar 28, 2024

My proposal does not need to use account_matches_owners(), that part is orthogonal.
I was more concerned about not touching the accounts-db at all for programs we have in the global cache, while you are aiming to touch each account only once, if we have to touch it. And both things can be combined.

What exactly is the purpose of the filter callback you have in mind, and why wouldn't we just hard-code a check against PROGRAM_OWNERS instead?

Copy link

mergify bot commented Dec 30, 2024

The Firedancer team maintains a line-for-line reimplementation of the
native programs, and until native programs are moved to BPF, those
implementations must exactly match their Agave counterparts.
If this PR represents a change to a native program implementation (not
tests), please include a reviewer from the Firedancer team. And please
keep refactors to a minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants