-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/state/snapshot: fix BAD BLOCK error when snapshot is generating #23635
core/state/snapshot: fix BAD BLOCK error when snapshot is generating #23635
Conversation
Good job! |
@holiman @karalabe @rjl493456442 could you please review my pr? |
Sorry, I somehow missed this PR -- looks like you went through a lot of debugging to find the problem. I'll investigate this asap, and try to think about if we can create a testcase to somehow find this |
Hi @zzyalbert I don't get it why the |
@rjl493456442 Yeah, it sure is in most cases. But I think this case could happen because the goroutine of resuming snapshot generation may not be scheduled in time |
Whenever generator receives interruption signal in storage callback(is generating storage data), then the aborted error will be bubbled up to interrupt the entire generation. |
I agree with @rjl493456442 -- I don't see how that could happen either. However, I'll add this and see if I can catch it:
|
Oh lookie (did a goerli sync)
|
|
alternative fix, courtesy of @karalabe : diff --git a/core/state/snapshot/generate.go b/core/state/snapshot/generate.go
index 3e11b4ac6b..c179a14539 100644
--- a/core/state/snapshot/generate.go
+++ b/core/state/snapshot/generate.go
@@ -563,6 +563,9 @@ func (dl *diskLayer) generate(stats *generatorStats) {
// Flush out the batch anyway no matter it's empty or not.
// It's possible that all the states are recovered and the
// generation indeed makes progress.
+ if bytes.Compare(currentLocation, dl.genMarker) < 0 {
+ panic(fmt.Sprintf("curr < genMarker: %x < %x", currentLocation, dl.genMarker))
+ }
journalProgress(batch, currentLocation, stats)
if err := batch.Write(); err != nil {
@@ -635,7 +638,11 @@ func (dl *diskLayer) generate(stats *generatorStats) {
stats.accounts++
}
// If we've exceeded our batch allowance or termination was requested, flush to disk
- if err := checkAndFlush(accountHash[:]); err != nil {
+ marker := accountHash[:]
+ if accMarker != nil && bytes.Equal(accountHash[:], accMarker) && len(dl.genMarker) > common.HashLength {
+ marker = append(marker, dl.genMarker[common.HashLength:]...)
+ }
+ if err := checkAndFlush(marker); err != nil {
return err
}
// If the iterated account is the contract, create a further loop to I'll spin up syncs on our benchmarking machines to validate it |
Oh, thanks, I really appreciate it. I'll think about it |
49cd470
to
3f98d0e
Compare
// If the snap generation goes here after interrupted, genMarker may go backward | ||
// when last genMarker is consisted of accountHash and storageHash | ||
if accMarker != nil && bytes.Equal(marker, accMarker) && len(dl.genMarker) > common.HashLength { | ||
marker = dl.genMarker[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes marker
use the same backing-slice as dl.genMarker
, whereas previously it was a copy. Might not matter in this case, but generally it's a less safe pattern
Thanks a lot for this @zzyalbert , I've had a feeling there was a bug there somewhere, but I haven't been able to pinpoint it, so kudos to you for doing that! |
…thereum#23635) * core/state/snapshot: fix BAD BLOCK error when snapshot is generating * core/state/snapshot: alternative fix for the snapshot generator * add comments and minor update Co-authored-by: Martin Holst Swende <[email protected]>
…thereum#23635) (ethereum#485) * core/state/snapshot: fix BAD BLOCK error when snapshot is generating * core/state/snapshot: alternative fix for the snapshot generator * add comments and minor update Co-authored-by: Martin Holst Swende <[email protected]> * core/state/snapshot: fix BAD BLOCK error when snapshot is generating * core/state/snapshot: alternative fix for the snapshot generator * add comments and minor update Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: Ziyuan Zhong <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]>
…erating (ethereum#23635)" This reverts commit 312e02b.
…thereum#23635) * core/state/snapshot: fix BAD BLOCK error when snapshot is generating * core/state/snapshot: alternative fix for the snapshot generator * add comments and minor update Co-authored-by: Martin Holst Swende <[email protected]>
This pr is trying to resolve the BAD BLOCK problem in #23531, it is a little obscure to find out
It occurred when geth was syncing from the block of some days ago in full sync mode, while generating snapshots from the beginning
The following is the process how 'BAD BLOCK' occurrs
diskLayer.genMarker = accountHash + storageHash0
accountHash + storageHash1
(storageHash1 < storageHash0) was read from snapshot diskLayer becauseaccountHash + storageHash1
<diskLayer.genMarker
. And the storage kv<accountHash + storageHash1, value1>
will be cached in diskLayer.cacheBlockChain.writeBlockWithState()
, and the bottom diffLayer is prepared to be merged to diskLayer by 'snapshot.diffToDisk()', thus will send a abort signal to snapshot generating processcheckAndFlush
(core/state/snapshot/generate.go:638, see picture below), receives the abort signal. But it will setdiskLayer.genMarker=accountHash
and thus makesdiskLayer.genMarker
become smaller(accountHash
<accountHash + storageHash1
)accountHash + storageHash1
is in that data but will be ignored because its key is greater thandiskLayer.genMarker
(accountHash)accountHash + storageHash1
will be finally written to diskLayer(core/state/snapshot/generate.go:673, see picture below) anddiskLayer.genMarker
will become larger thanaccountHash + storageHash1
, let's assumediskLayer.genMarker = accountHash + storageHash3
, but the value indiskLayer.cache
still remains OLDaccountHash + storageHash1
is read, the old value indiskLayer.cache
will be returned and that will cause a BAD BLOCK errorIn terms of the resolution, I have 2 proposals
diskLayer.genMarker
will never go smaller, which was applied in this prdiskLayer.cache
to make sure any stale value will be evicted, like following code(zzyalbert@ad30ba7)I currently prefer the first one because I think the second will increase a lot of cache query