clean all zero storages at index generation time #3196

HaoranYi · 2024-10-16T16:04:15Z

Problem

Split from #3188

We notice there are all zero account storages when we generate index at startup.

We should add them as candidates for "clean".

Summary of Changes

Add all zero account storages to clean when we generate index.
Add a stat for all zero account storages at start up.
~~Refactor to use local sum of stats before update the outer atomic stats~~

Fixes #

brooksprumo · 2024-10-16T21:45:16Z

Are there perf numbers for how this impacts a node?

How many stores are only accounts with zero lamports?

HaoranYi · 2024-10-16T22:03:17Z

Are there perf numbers for how this impacts a node?

How many stores are only accounts with zero lamports?

This has been a issue for a while, especially with long running skip rewrite nodes,
see #2880 for the discussion.

I will start a node to gather the stats.

HaoranYi · 2024-10-16T22:12:56Z

Note that this PR it self doesn't fully solve the problem.

#2880 (comment)

But this is a good step to the solution.

To fully solve the problem, we will need "shrink" PR, which handles single slot one ref zero accounts.

#3188

jeffwashington · 2024-10-16T22:20:13Z

@HaoranYi check this out, too.
https://github.com/anza-xyz/agave/pull/2880/files

I added something here to force us to shrink these all zero lamport account storages. This work is so old I've lost my memory of how well this worked or didn't. I think without the full solution, it isn't really helpful to run clean unless the entire storage is zero lamport. However, it seems highly unlikely with the code we have today that we'll end up with an entire slot being zero lamport accounts. It definitely is orthogonal to the other pr, so this should definitely be split out and put in or closed independently.

Also, seems like we may want a metric on how many of these we actually found at startup. I bet the answer is 0 today.

HaoranYi · 2024-10-16T23:55:31Z

The problem we have seen is that with long running skip rewrites, we accumulated all zero lamport storages up to 16K.

The first time, I saw this problem happens, was before we fixed shrink to remove single ref zero lamport accounts.

Maybe we don't have this problem now after the fix.

But, PR2880, motivated from the above problem, is a good fix to have anyway, in my opinion.
And it tries to fix left over zero accounts at startup. Therefore, I think this fix is generally a good fix to have. It will also protect us from malicious attacks - flooding zero lamport storages at start up.

@jeffwashington Your are right that the additional force shrink in this commit a5d6aac is the other import piece to fix the issue.

As said in the comments #2880 (comment),
Without "force shrink", we are unable to remove those dead slots by just adding them to clean.
And with "force shrink", we successfully removed 16K dead slots.

I have written up an analysis and a theory for the problem before on slack to explain why we need "shrink" as well as the "clean" in this PR.

Now I repost it here.

Consider the following example.

Here is how the clean algorithm works.
At startup, slot 100 have all zeros, inserted into clean and kick off clean.

clean starts:
- populate candidate keys (X, Y, Z, W, U) and candidate slots (200, 100).
- first reclaim for non-zero keys X/100
- However, X/100 can't be dropped because 100's total count = 5
  - it will add 100 to shrink candidates
- in calc dependency, we remove 100 from count_map and later visit to
  handle_claim for all zeros keys (Y-U) won't visit 100 because it is not found
  in count_map (unfortunately)
Shrink starts
- look at candidate 100, but its shrink live ratio is not below 80%. So it was skipped.

In the end 100 wasn't cleaned and won't be visited again unless X was rewritten. With skip rewrite and X being cold, slot 100 will unlikely to be touched again.

The new commit that force shrinking all zero slots, will let shrink work on 100 regardless. And thanks for the recent handling of all zeros added to shrink, shrink will see that all accounts in 100 are dead or single_ref_zero_accounts, and it can be dropped. hooray!

This explains what happens for slot 276499184 too.
https://anza-xyz.slack.com/archives/C0735T9CVPU/p1726063943441339

Where the first handle_reclaim runs, it only sees 5 accounts are dead out of 545. All
the rest are live zeros (single-ref zeros). Send it to shrink, which then skip it because of live ratio is
too low.

force shrink solves the problem to be working because it bypass the shrink productive check and shrink them anyway.

And the change to keep track of single ref zero lamports (#3188) will address the shrinking part of this problem." With #3188, shrink will know that Y-U are single-ref-zero accounts and can be shrunk. This make slot 100 to be qualifed for shrinking, which has the same effect as "force shrinking".

In summary, we need both clean fix (this PR) and the shrink fix (either #3188 or force shrink a5d6aac) to completely remove those zeros at startup.

HaoranYi · 2024-10-16T23:57:30Z

My node restart with this PR crashed ...
time to more debuging

[2024-10-16T22:15:42.414152173Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAccountsLo10" one=1i message="panicked at accounts-db/src/account_storage.rs:80:9:
    self.no_shrink_in_progress(): 296018580" location="accounts-db/src/account_storage.rs:80:9" version="2.1.0 (src:022d218e; feat:3176011155, client:Agave)"

accounts-db/src/accounts_db.rs

HaoranYi · 2024-10-17T21:29:12Z

I restarted my node.
Unfortunately, because the node was offline for too long. It started from a downloaded snapshot. So it doesn't actually test the fix.

But the good news is that there is no all zero slots from mnt snapshot.

[2024-10-17T20:08:28.436858014Z
INFO
solana_metrics::metrics]
...
all_accounts_are_zero_lamports_slots=0i (***)

HaoranYi · 2024-10-17T22:39:02Z

I tested it again with a restart on another node that's running skip rewrites. This time it finds 398 all zero slots at startup, and didn't crash.

[2024-10-17T22:26:46.145469392Z
INFO
solana_metrics::metrics]
datapoint:
...
all_accounts_are_zero_lamports_slots=398i (***)

HaoranYi · 2024-10-21T15:04:19Z

I made a few more commits to refactor the statup stats, i.e. accumulate the stat for a chunk locally then update the atomic outer stats.

brooksprumo

Code for handling all-zeros looks good to me

accounts-db/src/accounts_db.rs

brooksprumo

HaoranYi · 2024-10-22T19:17:17Z

Rebase and fix conflicts after #3265.
There should be no functional changes.

jeffwashington

lgtm

* clean all zero storeage at index gen time * add stats * wait to add zero slot to clean after index flush * pr * remove log * fix rebase --------- Co-authored-by: HaoranYi <[email protected]>

HaoranYi changed the title ~~clean all zero storeage at index gen time~~ clean all zero storage at index generation time Oct 16, 2024

HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from af2789a to 42ea974 Compare October 16, 2024 16:07

HaoranYi mentioned this pull request Oct 16, 2024

track zero lamport accounts #3188

Merged

HaoranYi changed the title ~~clean all zero storage at index generation time~~ clean all zero storage at index generation time (wip) Oct 16, 2024

HaoranYi changed the title ~~clean all zero storage at index generation time (wip)~~ clean all zero storages at index generation time (wip) Oct 16, 2024

HaoranYi changed the title ~~clean all zero storages at index generation time (wip)~~ clean all zero storages at index generation time Oct 16, 2024

HaoranYi requested review from jeffwashington, brooksprumo and dmakarov October 16, 2024 20:42

dmakarov previously approved these changes Oct 16, 2024

View reviewed changes

HaoranYi dismissed dmakarov’s stale review via 5de8adf October 17, 2024 18:55

HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from 5de8adf to 4e53cf6 Compare October 17, 2024 19:04

brooksprumo reviewed Oct 17, 2024

View reviewed changes

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved

accounts-db/src/accounts_db.rs Show resolved Hide resolved

HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from f117a23 to 087bd91 Compare October 21, 2024 14:42

HaoranYi requested review from brooksprumo and dmakarov October 22, 2024 14:25

brooksprumo reviewed Oct 22, 2024

View reviewed changes

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved

HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from 087bd91 to d7938e9 Compare October 22, 2024 16:38

HaoranYi requested a review from brooksprumo October 22, 2024 16:40

brooksprumo previously approved these changes Oct 22, 2024

View reviewed changes

dmakarov previously approved these changes Oct 22, 2024

View reviewed changes

HaoranYi added 5 commits October 22, 2024 19:03

clean all zero storeage at index gen time

cdb4c79

add stats

c86b5a6

wait to add zero slot to clean after index flush

a24e158

pr

c325e55

remove log

4e0c76c

HaoranYi dismissed stale reviews from dmakarov and brooksprumo via 4e0c76c October 22, 2024 19:08

HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from d7938e9 to 4e0c76c Compare October 22, 2024 19:08

fix rebase

47a9146

HaoranYi requested review from dmakarov and brooksprumo October 22, 2024 19:17

jeffwashington approved these changes Oct 22, 2024

View reviewed changes

dmakarov approved these changes Oct 22, 2024

View reviewed changes

brooksprumo approved these changes Oct 22, 2024

View reviewed changes

HaoranYi added the automerge automerge Merge this Pull Request automatically once CI passes label Oct 22, 2024

mergify bot merged commit de1940f into anza-xyz:master Oct 22, 2024
41 checks passed

HaoranYi deleted the clean_all_zeros_at_index_gen branch October 22, 2024 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean all zero storages at index generation time #3196

clean all zero storages at index generation time #3196

HaoranYi commented Oct 16, 2024 •

edited

Loading

brooksprumo commented Oct 16, 2024

HaoranYi commented Oct 16, 2024

HaoranYi commented Oct 16, 2024

jeffwashington commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 17, 2024 •

edited

Loading

HaoranYi commented Oct 17, 2024 •

edited

Loading

HaoranYi commented Oct 21, 2024 •

edited

Loading

brooksprumo left a comment

brooksprumo left a comment

HaoranYi commented Oct 22, 2024

jeffwashington left a comment

clean all zero storages at index generation time #3196

clean all zero storages at index generation time #3196

Conversation

HaoranYi commented Oct 16, 2024 • edited Loading

Problem

Summary of Changes

brooksprumo commented Oct 16, 2024

HaoranYi commented Oct 16, 2024

HaoranYi commented Oct 16, 2024

jeffwashington commented Oct 16, 2024 • edited Loading

HaoranYi commented Oct 16, 2024 • edited Loading

HaoranYi commented Oct 16, 2024 • edited Loading

HaoranYi commented Oct 17, 2024 • edited Loading

HaoranYi commented Oct 17, 2024 • edited Loading

HaoranYi commented Oct 21, 2024 • edited Loading

brooksprumo left a comment

Choose a reason for hiding this comment

brooksprumo left a comment

Choose a reason for hiding this comment

HaoranYi commented Oct 22, 2024

jeffwashington left a comment

Choose a reason for hiding this comment

HaoranYi commented Oct 16, 2024 •

edited

Loading

jeffwashington commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 16, 2024 •

edited

Loading

HaoranYi commented Oct 17, 2024 •

edited

Loading

HaoranYi commented Oct 17, 2024 •

edited

Loading

HaoranYi commented Oct 21, 2024 •

edited

Loading