You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a node crashes will archiving a full snapshot, and if it has created more (incremental) bank snapshots based on that full snapshot, then fastboot will likely fail with an error message like:
incremental snapshot requires accounts hash and capitalization from the full snapshot it is based on
Problem Details
Here's an example based on an error message that @jstarry sent me, after he added the patch from #35353:
[2024-02-29T01:51:48.833574440Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAcctHashVer" one=1i message="panicked at core/src/accounts_hash_verifier.rs:328:21:
0: rust_begin_unwind
at ./rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
1: core::panicking::panic_fmt
at ./rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
2: solana_core::accounts_hash_verifier::AccountsHashVerifier::process_accounts_package
incremental snapshot requires accounts hash and capitalization from the full snapshot it is based on
package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251086076)), slot: 251099346, block_height: 231733000, .. }
accounts hashes: {251098305: (AccountsHash(BNauzhxdBL7ZjVdQYwA5sFX6U4ZFnX8x1z2faXhCx5vy), 570710273002510414)}
incremental accounts hashes: {251099045: (IncrementalAccountsHash(HTXWvmVKKHB8RYwHfjdgr1oX1rmchDQC6kuBH2jeAo18), 20095246361845682)}
full snapshot archives: [FullSnapshotArchiveInfo(SnapshotArchiveInfo { path: \"/mnt/snapshots/snapshot-251086076-7ZCZ8PiKRTxjgmaQTXKGaPoEYA688deZ4x8r6FHUS5qQ.tar.zst\", slot: 251086076, hash: SnapshotHash(7ZCZ8PiKRTxjgmaQTXKGaPoEYA688deZ4x8r6FHUS5qQ), archive_format: TarZstd }), FullSnapshotArchiveInfo(SnapshotArchiveInfo { path: \"/mnt/snapshots/snapshot-251073753-325bXiW6oATv8i6LezYPR2hNu4P3JtrCECQLZBiJ9J22.tar.zst\", slot: 251073753, hash: SnapshotHash(325bXiW6oATv8i6LezYPR2hNu4P3JtrCECQLZBiJ9J22), archive_format: TarZstd })]
bank snapshots: [BankSnapshotInfo { slot: 251099346, snapshot_type: Pre, snapshot_dir: \"/mnt/incremental-snapshots/snapshot/251099346\", snapshot_version: V1_2_0 }, BankSnapshotInfo { slot: 251099045, snapshot_type: Post, snapshot_dir: \"/mnt/incremental-snapshots/snapshot/251099045\", snapshot_version: V1_2_0 }]" location="core/src/accounts_hash_verifier.rs:328:21" version="1.17.22 (src:2c5aa387; feat:3580551090, client:JitoLabs)"
Here's an example for the error:
SPS is in the middle of archiving a full snapshot (slot 251098305 in this case)
AHV processes the next incremental snapshot successfully. So the bank snapshot for AHV is created. And its expected full snapshot slot is 251098305.
Maybe a few more incremental bank snapshots are created too.
The node crashes before the full snapshot finishes being archived
At next startup, fastboot will grab the latest bank snapshot (slot 251099346), which will contain the account hashes for full snapshot at slot 251098305
When ABS starts up, it is told what the last full snapshot archive's slot is, which is slot 251086076
The first new snapshot request will get sent to ABS, and it'll be an incremental snapshot. ABS knows that the last full snapshot was for slot 251086076, so ABS packages up new snapshot with the old full snapshot slot and sends it over to AHV
AHV sees the incremental snapshot request and so it says "You asked me to make an incremental snapshot, so I need to know about the full snapshot you were based on. Please tell me about the full snapshot slot and its accounts hash". The request will say it's full snapshot is 251086076, but AccountsDb will only know about 251098305.
And then the panic is triggered.
More info
Here's more logs from Justin's machine. `
[2024-02-29T01:33:55.752948313Z INFO solana_runtime::snapshot_bank_utils] Creating bank snapshot for slot 251098305, path: /mnt/incremental-snapshots/snapshot/251098305/251098305.pre
[2024-02-29T01:34:06.521515316Z INFO solana_runtime::snapshot_bank_utils] bank serialize took 3.9s for slot 251098305 at /mnt/incremental-snapshots/snapshot/251098305/251098305.pre
[2024-02-29T01:34:06.522263848Z INFO solana_runtime::snapshot_package] Package snapshot for bank 251098305 has 421257 account storage entries (snapshot kind: FullSnapshot)
[2024-02-29T01:34:06.522275888Z INFO solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(FullSnapshot), slot: 251098305, bank hash: DrAkL3ceUCtYC5TEGx5orqkLn3FXiPQ5JxwWK8jgn1vT
[2024-02-29T01:34:06.859756460Z INFO solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(FullSnapshot), slot: 251098305, block_height: 231732000, .. }
[2024-02-29T01:34:51.320797198Z INFO solana_runtime::snapshot_package] Package snapshot for bank 251098409 has 421253 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:34:51.320814578Z INFO solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098409, bank hash: 7RiJFiJqvkZ4YchCo7SxtdjrwN4F5zrKXz4Ngy5NwpUE
[2024-02-29T01:34:53.575408055Z INFO solana_accounts_db::accounts_db] calculate_accounts_hash_from_storages: slot: 251098305, Full(AccountsHash(BNauzhxdBL7ZjVdQYwA5sFX6U4ZFnX8x1z2faXhCx5vy)), capitalization: 570710273002510414
[2024-02-29T01:35:06.154583143Z INFO solana_metrics::metrics] datapoint: fastboot slot=251098305i num_storages_total=421257i num_storages_kept_alive=15i
[2024-02-29T01:35:06.154587592Z INFO solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098409, block_height: 231732100, .. }
[2024-02-29T01:35:06.233502620Z INFO solana_core::snapshot_packager_service] handling snapshot package: SnapshotPackage { type: FullSnapshot, slot: 251098305, block_height: 231732000, .. }
[2024-02-29T01:35:06.233526780Z INFO solana_runtime::snapshot_utils] Generating snapshot archive for slot 251098305
[2024-02-29T01:35:27.307496396Z INFO solana_runtime::snapshot_package] Package snapshot for bank 251098509 has 421247 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:35:27.307507596Z INFO solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098509, bank hash: HggkzD851JtZK6JxpHuadqRayB5yUh1zU5Dn2ZKu7xLG
[2024-02-29T01:35:27.592021341Z INFO solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098509, block_height: 231732200, .. }
[2024-02-29T01:36:08.734285801Z INFO solana_runtime::snapshot_package] Package snapshot for bank 251098613 has 421251 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:36:08.734305441Z INFO solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098613, bank hash: 3FRAp6Yn6EKibVTXDhEmVxsppvSSRFXP3PAZidj5UFgV
[2024-02-29T01:36:08.737970571Z INFO solana_core::accounts_hash_verifier] handling accounts package: AccountsPackage { kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098613, block_height: 231732300, .. }
[2024-02-29T01:36:53.860137018Z INFO solana_runtime::snapshot_package] Package snapshot for bank 251098713 has 421255 account storage entries (snapshot kind: IncrementalSnapshot(251098305))
[2024-02-29T01:36:53.860152468Z INFO solana_runtime::accounts_background_service] Took bank snapshot. accounts package kind: Snapshot(IncrementalSnapshot(251098305)), slot: 251098713, bank hash: B5ciX3tdiSdAZVoHz6vQ2THRGdhuhfBiBPY7ZPjvR97p
The node was archiving a full snapshot for slot 251098305 when it crashed
The node had created multiple incremental bank snapshots beyond slot 251098305
Proposed Solution
Yikes! So we need some way to identify if the fastboot bank snapshot matches the actual full snapshot archives on disk or not. And if they don't, this bank snapshot should be purged.
Unfortunately, we cannot use older bank snapshots, because their account storage files have likely been recycled/shrunk. So we need to fallback on using a snapshot archive.
(Edit: The recycler has now been removed, so in theory we could use older bank snapshots. This needs testing first.)
Option 1:
When taking a snapshot, add a new file to indicate full vs incremental, and the important slots. Then, at load time, fastboot can see if it's an incremental, and what the base slot is. If there's not a full snapshot archive with the given slot, then we cannot use this snapshot. Delete it.
If this process is done before we decide to fastboot or not, then it should correctly restart with a snapshot archive.
Option 2:
Similar to Option 1, we add the same new file to indicate the important slots. But instead, at load time, if there's not a full snapshot archive for the given slot, then we immediately generate a new full snapshot archive for the next snapshot request. This may have more code changes to handle. And may increase disk io. But does start up from a more-recent slot than Option 1. If there's another crash before the new full snapshot archive is made, we'll likely be in the same scenario.
(h/t to @apfitzge for this possible solution)
Work-arounds
There are some work-arounds available already, and they boil down to loading from a snapshot archive, instead of local state.
Use --use-snapshot-archives-at-startup always, to force loading from a snapshot archive
Delete the (non-archive) snapshots directory (often ledger/snapshots/), which is where the local state lives. By deleting this directory, startup will fallback to loading from a snapshot archive.
Now that #35350 has merged, the problematic local state will be removed automatically. So a subsequent restart—without needing to do anything manually—should work.
The text was updated successfully, but these errors were encountered:
An additional note, currently the system cannot recover from this. Meaning subsequent reboots will keep hitting this panic (or the other one about invalid append vecs: #35190).
Luckily #35350 will recover, so if there's one failure, the next reboot will successfully fallback to loading from a snapshot archive.
Problem
If a node crashes will archiving a full snapshot, and if it has created more (incremental) bank snapshots based on that full snapshot, then fastboot will likely fail with an error message like:
Problem Details
Here's an example based on an error message that @jstarry sent me, after he added the patch from #35353:
Here's an example for the error:
More info
Here's more logs from Justin's machine. `
In particular:
This confirms that:
Proposed Solution
Yikes! So we need some way to identify if the fastboot bank snapshot matches the actual full snapshot archives on disk or not. And if they don't, this bank snapshot should be purged.
Unfortunately, we cannot use older bank snapshots, because their account storage files have likely been recycled/shrunk. So we need to fallback on using a snapshot archive.
(Edit: The recycler has now been removed, so in theory we could use older bank snapshots. This needs testing first.)
Option 1:
When taking a snapshot, add a new file to indicate full vs incremental, and the important slots. Then, at load time, fastboot can see if it's an incremental, and what the base slot is. If there's not a full snapshot archive with the given slot, then we cannot use this snapshot. Delete it.
If this process is done before we decide to fastboot or not, then it should correctly restart with a snapshot archive.
Option 2:
Similar to Option 1, we add the same new file to indicate the important slots. But instead, at load time, if there's not a full snapshot archive for the given slot, then we immediately generate a new full snapshot archive for the next snapshot request. This may have more code changes to handle. And may increase disk io. But does start up from a more-recent slot than Option 1. If there's another crash before the new full snapshot archive is made, we'll likely be in the same scenario.
(h/t to @apfitzge for this possible solution)
Work-arounds
There are some work-arounds available already, and they boil down to loading from a snapshot archive, instead of local state.
--use-snapshot-archives-at-startup always
, to force loading from a snapshot archiveledger/snapshots/
), which is where the local state lives. By deleting this directory, startup will fallback to loading from a snapshot archive.Now that #35350 has merged, the problematic local state will be removed automatically. So a subsequent restart—without needing to do anything manually—should work.
The text was updated successfully, but these errors were encountered: