-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds StartingSnapshotStorages to AccountsHashVerifier #58
Adds StartingSnapshotStorages to AccountsHashVerifier #58
Conversation
I haven't taken the time to look deeply at every code change. I did skim it and I don't think this is addressed.
|
Yeah, it's from recycling. This was an issue for all of fastboot initially. The fix is with agave/core/src/accounts_hash_verifier.rs Lines 54 to 57 in 3f9a7a5
I don't know enough about recycling to do (1). Is it safe? Why was it added initially? Code spelunking sees to indicate that recreating the append vecs was an issue (maybe the underlying mmap?). Is that not an issue anymore? And for (2), that's what this PR does. (Except you cannot drop the storages until the subsequent bank snapshot POST is made.) |
it is a mysterious area of the code. We could probably use some more expertise in it. Originally append vecs were the way to store written accounts from tx processing. This model isn't used anymore. We would also use multiple append vecs per slot. Also not used anymore. Maybe there is still a place for recycling. I do know it is under suspicion so often and sometimes IS the reason some operation has a hole (like this one). It definitely adds complexity. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #58 +/- ##
=======================================
Coverage 81.8% 81.8%
=======================================
Files 837 838 +1
Lines 225922 225949 +27
=======================================
+ Hits 184955 184980 +25
- Misses 40967 40969 +2 |
Ok, I'll take a look into this. And this PR may be the recycling genesis: solana-labs#12885 |
@@ -54,7 +56,11 @@ impl AccountsHashVerifier { | |||
// To support fastboot, we must ensure the storages used in the latest POST snapshot are | |||
// not recycled nor removed early. Hold an Arc of their AppendVecs to prevent them from | |||
// expiring. | |||
let mut fastboot_storages = None; | |||
let mut fastboot_storages = match starting_snapshot_storages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fastboot_storages
already existed and is overwritten later in the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes seem reasonable to me, we cache an Arc of the storages we used to load our bank from. By doing so it prevents them from being shrunk so long as we hold the Arc. We had already done this for snapshots we were creating, and this PR makes startup more similar to other code.
We have some hidden argument for skipping (or not) shrinking/cleaning. It seems we will now skip shrinking unless we are loading from archives - do we need to add any sort of conflict with these args & fastboot?
I think we'll be OK. By holding these Arcs, we prevent them from getting recycled. When/if shrink runs (startup or otherwise), I believe it'll create a new append vec for the shrunk results, instead of recycling one of these append vecs. @jeffwashington does that sound right? |
this is correct. This change prevents recycling the append vecs used in fast boot. |
this cli arg causes the validator to start first, then do the initial clean and then shrink in the background. The previous behavior was to wait to start the validator (replay, turbine) until the initial clean and shrink completed. |
Problem
When starting up from fastboot, if
shrink
runs before taking the next bank snapshot, then snapshot storages can change. And then if the node restarts before taking the next bank snapshot, it may fail because the snapshot storages are now wrong.Please see solana-labs#35376 for more information.
Summary of Changes
At startup, get the storages that were loaded from, for fastboot. Pass them into AccountsHashVerifier, because AHV is in charge of holding fastboot storages to prevent early cleanup.
I tested this with a node on mnb.
shrink
runs)And it worked! Previously, this would crash.
Fixes solana-labs#35376