Change read-only-cache to be keyed on pubkey only #576

HaoranYi · 2024-04-04T15:06:39Z

Problem

The way that readonly cache is keyed on (pubkey, slot) can lead to caching
stale entries and inefficient use of the cache.

By design, readonly cache is shadowed by write_cache for the fork tips. Because
of this, it is very unlikely that we need more than one entry per pubkey in the
cache. Extra entry in the cache is a waste since it is from stale slot and will
eventually be evicted over the time.

Also, the extra slot in the key increase the FIFO size for LRU.

Summary of Changes

Rework readonly cache to be keyed on pubkey only.

Fixes #

HaoranYi · 2024-04-04T15:08:07Z

read_only_accounts_cache_entries=31822i read_only_accounts_cache_uniq_keys=30890i read_only_accounts_cache_max_dups=2i
read_only_accounts_cache_entries=31709i read_only_accounts_cache_uniq_keys=30820i read_only_accounts_cache_max_dups=2i
read_only_accounts_cache_entries=30828i read_only_accounts_cache_uniq_keys=30052i read_only_accounts_cache_max_dups=2i
read_only_accounts_cache_entries=30417i read_only_accounts_cache_uniq_keys=29720i read_only_accounts_cache_max_dups=2i

HaoranYi · 2024-04-04T15:08:20Z

jeffwashington · 2024-04-04T15:33:46Z

i don't understand the 4 logs lines. What is the important data and difference?

jeffwashington · 2024-04-04T15:54:49Z

I'm concerned about getting the wrong data with various race conditions I could imagine. using the slot enforces that we are getting the right version of each account.
maybe: read account, goes into write cache
write account, wiped from write cache
other fork reads and puts account into write cache
flush main fork's write cache in bg. after index entry gets updated to newly flushed slot, entry is already in write cache from other fork at a different (older, newer, different fork?) slot than the one that just got flushed. So, that wrong one gets loaded.

We don't clear the read cache when we drop slots or forks. That's an issue for this scheme, I think. The index is updated correctly in those cases (I have to assume).

HaoranYi · 2024-04-04T16:32:07Z

I'm concerned about getting the wrong data with various race conditions I could imagine. using the slot enforces that we are getting the right version of each account.

In this PR, we are still checking the slot. The slot is stored along the 'AccountSharedData' in the value. When loading, we check both pubkey (from the map) and slot (from the value).

HaoranYi · 2024-04-04T16:37:51Z

i don't understand the 4 logs lines. What is the important data and difference?

read_only_accounts_cache_entries records the number of entries in the cache, i.e. distinct (pubkey, slot) pairs.
read_only_accounts_cache_uniq_keys records the number of unique pubkeys in the cache.
read_only_accounts_cache_max_dups records the max number of slots per pubkey in the cache.

Take read_only_accounts_cache_entries=31709i read_only_accounts_cache_uniq_keys=30820i read_only_accounts_cache_max_dups=2i as an example. The cache stored 31,709 (pubkey, slot). Only have 30,820 unique pubkeys. Therefore, given that max_num_slot per pubkey is 2, there are 889 pubkeys that have two slots of data stored in the cache, one of these two slots would be stale data. Percentage wise, it is around 3%.

codecov-commenter · 2024-04-09T22:08:57Z

Codecov Report

Attention: Patch coverage is 91.66667% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (c0be86d) to head (21952ae).
Report is 28 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #576   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         851      851           
  Lines      230475   230514   +39     
=======================================
+ Hits       188785   188818   +33     
- Misses      41690    41696    +6

HaoranYi · 2024-04-09T22:36:22Z

Rebase to master to pick up brook's background evicts changes.

replay load time
hit/miss/evicts
another view of hit/miss (log)

In all the graph, 4wp represents this pr. yh14 represents master.

jeffwashington · 2024-04-09T22:54:56Z

so in theory, writing an account would evict the pubkey from the read cache completely. The theory must be that there are accounts that are flushed to append vecs, read to put in read cache, then written 100+ slots later, then flushed, then read from the newer slot and put in the read cache where now the account is in the read cache twice (or more)?
@HaoranYi can you run mnb for a bit and run some debugging code that tells us how many accounts are in the read only cache multiple times and how big the accounts that are in there multiple times are?
would be great to know this pattern.

HaoranYi · 2024-04-09T23:02:30Z

so in theory, writing an account would evict the pubkey from the read cache completely. The theory must be that there are accounts that are flushed to append vecs, read to put in read cache, then written 100+ slots later, then flushed, then read from the newer slot and put in the read cache where now the account is in the read cache twice (or more)? @HaoranYi can you run mnb for a bit and run some debugging code that tells us how many accounts are in the read only cache multiple times and how big the accounts that are in there multiple times are? would be great to know this pattern.

Sure. I will run this experiment.

brooksprumo · 2024-04-10T16:15:58Z

so in theory, writing an account would evict the pubkey from the read cache completely. The theory must be that there are accounts that are flushed to append vecs, read to put in read cache, then written 100+ slots later, then flushed, then read from the newer slot and put in the read cache where now the account is in the read cache twice (or more)?

Why doesn't this "written 100+ slots later" cause the account be removed from the read cache though?

Is it because when we store to the write cache and remove from the read cache, we use the slot being stored to as a check into the read cache? Presumably this check in the read cache would always fail, and thus never remove old read cache entries. Maybe that's the crux of the change—that we can now correctly remove entries in the read cache that have been superseded by entries in the write cache?

HaoranYi · 2024-04-10T19:08:38Z

so in theory, writing an account would evict the pubkey from the read cache completely. The theory must be that there are accounts that are flushed to append vecs, read to put in read cache, then written 100+ slots later, then flushed, then read from the newer slot and put in the read cache where now the account is in the read cache twice (or more)?

Why doesn't this "written 100+ slots later" cause the account be removed from the read cache though?

Is it because when we store to the write cache and remove from the read cache, we use the slot being stored to as a check into the read cache? Presumably this check in the read cache would always fail, and thus never remove old read cache entries. Maybe that's the crux of the change—that we can now correctly remove entries in the read cache that have been superseded by entries in the write cache?

Yes. Writing account at slot 'x' won't remove cached entry of slot 'y'. So stale cache entry at slot 'y' will only be removed until it becomes LRU.
Also having (pubkey, slot) as the cache key, it means that we need to keep both (pubkey, slot) in the LRU list too, which also adds 8 bytes of unnecessary storage cost per element.

The single pubkey ensures that we have at most one entry per pubkey. And it also helps reduce the numbers of evicts. The stale element from the same pubkey will be replaced when new updates are inserted, which saves the cost of evictions in the current code base (as shown in the 2nd graph on the right). Of course, evicts matter less now thanks to the background evicts 😉

brooksprumo

I like the idea of this change!

Should there be new tests added that ensure

storing in a new slot "removes" the old one?
calling remove on a pubkey works regardless of its slot?
calling load correctly checks slot?

Some (all?) of these may be covered by existing tests too.

accounts-db/src/read_only_accounts_cache.rs

HaoranYi · 2024-04-10T20:00:02Z

Another aspect of this change is to help when we scale the cache size.
With current code, no matter how much cache size we allocated, the cache will always be full (i.e. account_size pegged at max cache size) because stale data won't be evicted and keep staying in the cache until the cache is full and then it will be LRUs.

HaoranYi · 2024-04-10T20:44:26Z

accounts-db/src/read_only_accounts_cache.rs

@@ -439,6 +450,10 @@ mod tests {
        cache.evict_in_foreground();
        assert_eq!(100 + per_account_size, cache.data_size());
        assert!(accounts_equal(&cache.load(key1, slot).unwrap(), &account1));
+        // pass a wrong slot and check that load fails
+        assert!(cache.load(key1, slot + 1).is_none());


calling load correctly checks slot?
Added coverage.

HaoranYi · 2024-04-10T20:45:05Z

accounts-db/src/read_only_accounts_cache.rs

+        // pass a wrong slot and check that load fails
+        assert!(cache.load(key1, slot + 1).is_none());
+        // insert another entry for slot+1, and assert only one entry for key1 is in the cache
+        cache.store(key1, slot + 1, account1.clone());


storing in a new slot "removes" the old one?
Added coverage.

HaoranYi · 2024-04-10T20:47:58Z

storing in a new slot "removes" the old one?

Added.

calling remove on a pubkey works regardless of its slot?

Already covered
solana/accounts-db/src/read_only_accounts_cache.rs:468

calling load correctly checks slot?

Added.

brooksprumo

Lgtm! Makes the read cache more efficient, and thus improves the time to load accounts.

accounts-db/src/read_only_accounts_cache.rs

HaoranYi · 2024-04-12T14:25:23Z

Got @jeffwashington's approval from slack.

HaoranYi changed the title ~~change read-only-cache to keyed on pubkey only~~ Change read-only-cache to keyed on pubkey only Apr 4, 2024

HaoranYi changed the title ~~Change read-only-cache to keyed on pubkey only~~ Change read-only-cache to be keyed on pubkey only Apr 4, 2024

change read-only-cache to keyed on pubkey only

181a853

HaoranYi force-pushed the accounts_db/read_cache_opt branch from 8365e34 to 181a853 Compare April 9, 2024 20:23

brooksprumo reviewed Apr 10, 2024

View reviewed changes

HaoranYi marked this pull request as ready for review April 10, 2024 20:25

implement review feedbacks

21952ae

HaoranYi force-pushed the accounts_db/read_cache_opt branch from 5f1395d to 21952ae Compare April 10, 2024 20:41

HaoranYi commented Apr 10, 2024

View reviewed changes

HaoranYi requested a review from jeffwashington April 10, 2024 20:49

brooksprumo self-requested a review April 11, 2024 11:51

brooksprumo approved these changes Apr 11, 2024

View reviewed changes

accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved

HaoranYi merged commit 518a5b7 into anza-xyz:master Apr 12, 2024
38 checks passed

HaoranYi deleted the accounts_db/read_cache_opt branch April 12, 2024 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change read-only-cache to be keyed on pubkey only #576

Change read-only-cache to be keyed on pubkey only #576

HaoranYi commented Apr 4, 2024 •

edited

Loading

HaoranYi commented Apr 4, 2024

HaoranYi commented Apr 4, 2024

jeffwashington commented Apr 4, 2024

jeffwashington commented Apr 4, 2024

HaoranYi commented Apr 4, 2024 •

edited

Loading

HaoranYi commented Apr 4, 2024 •

edited

Loading

codecov-commenter commented Apr 9, 2024 •

edited

Loading

HaoranYi commented Apr 9, 2024 •

edited

Loading

jeffwashington commented Apr 9, 2024

HaoranYi commented Apr 9, 2024

brooksprumo commented Apr 10, 2024 •

edited

Loading

HaoranYi commented Apr 10, 2024 •

edited

Loading

brooksprumo left a comment •

edited

Loading

HaoranYi commented Apr 10, 2024 •

edited

Loading

HaoranYi Apr 10, 2024 •

edited

Loading

HaoranYi Apr 10, 2024

HaoranYi commented Apr 10, 2024 •

edited

Loading

brooksprumo left a comment

HaoranYi commented Apr 12, 2024

Change read-only-cache to be keyed on pubkey only #576

Change read-only-cache to be keyed on pubkey only #576

Conversation

HaoranYi commented Apr 4, 2024 • edited Loading

Problem

Summary of Changes

HaoranYi commented Apr 4, 2024

HaoranYi commented Apr 4, 2024

jeffwashington commented Apr 4, 2024

jeffwashington commented Apr 4, 2024

HaoranYi commented Apr 4, 2024 • edited Loading

HaoranYi commented Apr 4, 2024 • edited Loading

codecov-commenter commented Apr 9, 2024 • edited Loading

Codecov Report

HaoranYi commented Apr 9, 2024 • edited Loading

jeffwashington commented Apr 9, 2024

HaoranYi commented Apr 9, 2024

brooksprumo commented Apr 10, 2024 • edited Loading

HaoranYi commented Apr 10, 2024 • edited Loading

brooksprumo left a comment • edited Loading

Choose a reason for hiding this comment

HaoranYi commented Apr 10, 2024 • edited Loading

HaoranYi Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

HaoranYi Apr 10, 2024

Choose a reason for hiding this comment

HaoranYi commented Apr 10, 2024 • edited Loading

brooksprumo left a comment

Choose a reason for hiding this comment

HaoranYi commented Apr 12, 2024

HaoranYi commented Apr 4, 2024 •

edited

Loading

HaoranYi commented Apr 4, 2024 •

edited

Loading

HaoranYi commented Apr 4, 2024 •

edited

Loading

codecov-commenter commented Apr 9, 2024 •

edited

Loading

HaoranYi commented Apr 9, 2024 •

edited

Loading

brooksprumo commented Apr 10, 2024 •

edited

Loading

HaoranYi commented Apr 10, 2024 •

edited

Loading

brooksprumo left a comment •

edited

Loading

HaoranYi commented Apr 10, 2024 •

edited

Loading

HaoranYi Apr 10, 2024 •

edited

Loading

HaoranYi commented Apr 10, 2024 •

edited

Loading