Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean all zero storages at index generation time #3196

Merged
merged 6 commits into from
Oct 22, 2024

Conversation

HaoranYi
Copy link

@HaoranYi HaoranYi commented Oct 16, 2024

Problem

Split from #3188

We notice there are all zero account storages when we generate index at startup.

We should add them as candidates for "clean".

Summary of Changes

  • Add all zero account storages to clean when we generate index.
  • Add a stat for all zero account storages at start up.
  • Refactor to use local sum of stats before update the outer atomic stats

Fixes #

@HaoranYi HaoranYi changed the title clean all zero storeage at index gen time clean all zero storage at index generation time Oct 16, 2024
@HaoranYi HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from af2789a to 42ea974 Compare October 16, 2024 16:07
@HaoranYi HaoranYi changed the title clean all zero storage at index generation time clean all zero storage at index generation time (wip) Oct 16, 2024
@HaoranYi HaoranYi changed the title clean all zero storage at index generation time (wip) clean all zero storages at index generation time (wip) Oct 16, 2024
@HaoranYi HaoranYi changed the title clean all zero storages at index generation time (wip) clean all zero storages at index generation time Oct 16, 2024
dmakarov
dmakarov previously approved these changes Oct 16, 2024
@brooksprumo
Copy link

Are there perf numbers for how this impacts a node?

How many stores are only accounts with zero lamports?

@HaoranYi
Copy link
Author

Are there perf numbers for how this impacts a node?

How many stores are only accounts with zero lamports?

This has been a issue for a while, especially with long running skip rewrite nodes,
see #2880 for the discussion.

I will start a node to gather the stats.

@HaoranYi
Copy link
Author

Note that this PR it self doesn't fully solve the problem.

#2880 (comment)

But this is a good step to the solution.

To fully solve the problem, we will need "shrink" PR, which handles single slot one ref zero accounts.

#3188

@jeffwashington
Copy link

jeffwashington commented Oct 16, 2024

@HaoranYi check this out, too.
https://github.com/anza-xyz/agave/pull/2880/files

I added something here to force us to shrink these all zero lamport account storages. This work is so old I've lost my memory of how well this worked or didn't. I think without the full solution, it isn't really helpful to run clean unless the entire storage is zero lamport. However, it seems highly unlikely with the code we have today that we'll end up with an entire slot being zero lamport accounts. It definitely is orthogonal to the other pr, so this should definitely be split out and put in or closed independently.

Also, seems like we may want a metric on how many of these we actually found at startup. I bet the answer is 0 today.

@HaoranYi
Copy link
Author

HaoranYi commented Oct 16, 2024

The problem we have seen is that with long running skip rewrites, we accumulated all zero lamport storages up to 16K.

The first time, I saw this problem happens, was before we fixed shrink to remove single ref zero lamport accounts.

Maybe we don't have this problem now after the fix.

But, PR2880, motivated from the above problem, is a good fix to have anyway, in my opinion.
And it tries to fix left over zero accounts at startup. Therefore, I think this fix is generally a good fix to have. It will also protect us from malicious attacks - flooding zero lamport storages at start up.

@jeffwashington Your are right that the additional force shrink in this commit a5d6aac is the other import piece to fix the issue.

As said in the comments #2880 (comment),
Without "force shrink", we are unable to remove those dead slots by just adding them to clean.
And with "force shrink", we successfully removed 16K dead slots.

I have written up an analysis and a theory for the problem before on slack to explain why we need "shrink" as well as the "clean" in this PR.

Now I repost it here.

Consider the following example.

  200 100
X   1   0
Y       0
Z       0
W       0
U       0

Here is how the clean algorithm works.
At startup, slot 100 have all zeros, inserted into clean and kick off clean.

  • clean starts:
    • populate candidate keys (X, Y, Z, W, U) and candidate slots (200, 100).
    • first reclaim for non-zero keys X/100
    • However, X/100 can't be dropped because 100's total count = 5
      • it will add 100 to shrink candidates
    • in calc dependency, we remove 100 from count_map and later visit to
      handle_claim for all zeros keys (Y-U) won't visit 100 because it is not found
      in count_map (unfortunately)
  • Shrink starts
    • look at candidate 100, but its shrink live ratio is not below 80%. So it was skipped.

In the end 100 wasn't cleaned and won't be visited again unless X was rewritten. With skip rewrite and X being cold, slot 100 will unlikely to be touched again.

The new commit that force shrinking all zero slots, will let shrink work on 100 regardless. And thanks for the recent handling of all zeros added to shrink, shrink will see that all accounts in 100 are dead or single_ref_zero_accounts, and it can be dropped. hooray!

This explains what happens for slot 276499184 too.
https://anza-xyz.slack.com/archives/C0735T9CVPU/p1726063943441339

Where the first handle_reclaim runs, it only sees 5 accounts are dead out of 545. All
the rest are live zeros (single-ref zeros). Send it to shrink, which then skip it because of live ratio is
too low.

force shrink solves the problem to be working because it bypass the shrink productive check and shrink them anyway.

And the change to keep track of single ref zero lamports (#3188) will address the shrinking part of this problem." With #3188, shrink will know that Y-U are single-ref-zero accounts and can be shrunk. This make slot 100 to be qualifed for shrinking, which has the same effect as "force shrinking".

In summary, we need both clean fix (this PR) and the shrink fix (either #3188 or force shrink a5d6aac) to completely remove those zeros at startup.

@HaoranYi
Copy link
Author

HaoranYi commented Oct 16, 2024

My node restart with this PR crashed ...
time to more debuging

[2024-10-16T22:15:42.414152173Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAccountsLo10" one=1i message="panicked at accounts-db/src/account_storage.rs:80:9:
    self.no_shrink_in_progress(): 296018580" location="accounts-db/src/account_storage.rs:80:9" version="2.1.0 (src:022d218e; feat:3176011155, client:Agave)"

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved
accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved
accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved
accounts-db/src/accounts_db.rs Show resolved Hide resolved
@HaoranYi
Copy link
Author

HaoranYi commented Oct 17, 2024

I restarted my node.
Unfortunately, because the node was offline for too long. It started from a downloaded snapshot. So it doesn't actually test the fix.

But the good news is that there is no all zero slots from mnt snapshot.

[2024-10-17T20:08:28.436858014Z
INFO
solana_metrics::metrics]
...
all_accounts_are_zero_lamports_slots=0i (***)

@HaoranYi
Copy link
Author

HaoranYi commented Oct 17, 2024

I tested it again with a restart on another node that's running skip rewrites. This time it finds 398 all zero slots at startup, and didn't crash.

[2024-10-17T22:26:46.145469392Z
INFO
solana_metrics::metrics]
datapoint:
...
all_accounts_are_zero_lamports_slots=398i (***)

@HaoranYi HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from f117a23 to 087bd91 Compare October 21, 2024 14:42
@HaoranYi
Copy link
Author

HaoranYi commented Oct 21, 2024

I made a few more commits to refactor the statup stats, i.e. accumulate the stat for a chunk locally then update the atomic outer stats.

Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code for handling all-zeros looks good to me

accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved
@HaoranYi HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from 087bd91 to d7938e9 Compare October 22, 2024 16:38
@HaoranYi HaoranYi requested a review from brooksprumo October 22, 2024 16:40
brooksprumo
brooksprumo previously approved these changes Oct 22, 2024
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

dmakarov
dmakarov previously approved these changes Oct 22, 2024
@HaoranYi HaoranYi dismissed stale reviews from dmakarov and brooksprumo via 4e0c76c October 22, 2024 19:08
@HaoranYi HaoranYi force-pushed the clean_all_zeros_at_index_gen branch from d7938e9 to 4e0c76c Compare October 22, 2024 19:08
@HaoranYi
Copy link
Author

Rebase and fix conflicts after #3265.
There should be no functional changes.

Copy link

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@HaoranYi HaoranYi added the automerge automerge Merge this Pull Request automatically once CI passes label Oct 22, 2024
@mergify mergify bot merged commit de1940f into anza-xyz:master Oct 22, 2024
41 checks passed
@HaoranYi HaoranYi deleted the clean_all_zeros_at_index_gen branch October 22, 2024 21:17
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
* clean all zero storeage at index gen time

* add stats

* wait to add zero slot to clean  after index flush

* pr

* remove log

* fix rebase

---------

Co-authored-by: HaoranYi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge automerge Merge this Pull Request automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants