-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(concurrency): applying blocks concurrently can lead to unexpected errors #700
fix(concurrency): applying blocks concurrently can lead to unexpected errors #700
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mtsitrin , I think it makes sense, but these things are always tricky.
I added some comments, could you please check if they make sense?
Also
From my analysis, maybe we can combine the produce and execute mutexes into 1 mutex, and rename it to something which makes sense for both cases?
Original problem description:
since store and abci access is not atomic, can have races
if two goroutines try to apply block
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
submitBatchMutex sync.Mutex
on handleSubmissionTrigger, try to lock and unlock
calls produceBlock
produceBlockMutex sync.Mutex
on handleSubmissionTrigger
in ProduceBlockLoop
on produceBlock lock and unlock
calls applyBlock
executeBlockMutex sync.Mutex
on applyBlockCallback lock and unlock
call apply block
on initialSync
in retrieveLoop
on syncUntilTarget, lock and unlock
call apply block
my reasoning being that depending on aggregator/not-aggregator, they are never using both mutexes
@danwt u right one thing at a time |
…n-lead-to-unexpected-errors
block/retriever.go
Outdated
@@ -42,6 +42,9 @@ func (m *Manager) RetriveLoop(ctx context.Context) { | |||
// It fetches the batches from the settlement, gets the DA height and gets | |||
// the actual blocks from the DA. | |||
func (m *Manager) syncUntilTarget(ctx context.Context, syncTarget uint64) error { | |||
m.executeBlockMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This process (syncUntilTarget
) can be long as it needs to actually fetch the data from the DA upon new sync target (which can be dozens of seconds or even longer).
seems to me like during this time no gossiped blocks will be applied as the lock will block it (so basically rpc nodes won't provide "real-time" responses while syncing from the da) .
I think the lock should be more fine grained on the actual execution of the blocks and not include the fetching from the da.
Why are we not just putting it directly on the applyBlock
/executeBlock
function to only allow one caller access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggreed. fetching data won't be mutex locked.
during this time no gossiped block won't be applied
gossiped block will be applied if it's height is correct, it quite conflicts with syncing process. I think it's fine for gossiped block to wait while syncing in progress.
in the happy flow there's no syncing anyway (as blocks are gossiped)
directly on the applyBlock/executeBlock function
The lock is more lock regarding the store.Height()
and not only on execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not sure I understand why can't executeBlockMutex
can't live inside applyBlock?
the only operation I see that is being protected with current implementation is m.store.NextHeight()
.
If that's indeed the case I suggest changing the nextHeight
or in general height
access to be atomic and by that simplify the lock and only put it on the applyBlock
function.
block/retriever.go
Outdated
} | ||
// check for cached blocks | ||
err := m.attemptApplyCachedBlocks(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be here. it should be in applyBlockCallback
function. Not sure when it was removed from there.
The purpose of this is the following:
- full node is at height x
- full node got gossiped with blocks at height x+2, x+3,.. , x+n
- full node can't apply it until it has x+1 recieved, so it keeps those blocks in the cache
- once full node gets x+1, than it applies all the rest from the cache.
Not sure why was this moved here and what's the purpose of it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's importnat for it to be here
- full node missed x
- full node got x+1 till x+3000
- got sync target of assuming 2000 from SL
it make sense to go over the cache and apply what possible.
otherwise u'll wait for next gossiped block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please request my review again after conflicts and Omri's Q's resolved
…n-lead-to-unexpected-errors
pruning cache only on synctarget
@@ -140,13 +141,17 @@ func (m *Manager) attemptApplyCachedBlocks() error { | |||
m.logger.Debug("applied cached block", "height", expectedHeight) | |||
} | |||
|
|||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the pruning from each gossiped block, as it's not efficient, it goes over all the cached blocks.
it will be called when syncing the node
block/block.go
Outdated
} | ||
|
||
// pruneCache prunes the cache of gossiped blocks. | ||
func (m *Manager) pruneCache() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why we need this pruneCache vs just deleting each cached block after we apply it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it was to handle the cases blocks didn't applied from cache, but from syncTarge
block/block.go
Outdated
func (m *Manager) attemptApplyCachedBlocks() error { | ||
m.applyCachedBlockMutex.Lock() | ||
defer m.applyCachedBlockMutex.Unlock() | ||
m.executeBlockMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned before, I think it's best to have this lock on the ApplyBlock function (and change it's name accordingly) as even if you get the race condition on NextHeight as current block with this height is currently being applied, the ApplyBlock
have a sanity check on correct height and the block won't be applied.
I think it makes the code much more elegant and simplifies reading and the needs of dealing with future applyBlock
callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this mutex is not only for applyBlock
it mutex between retriever
thread and gossip
thread
there are multiple params that can be accessed concurrently and needs protection (e.g blockCache, store height, "state" (apply block))
I can change it if u prefer, but IMO it's not hermetic enough
applyBlock
for full node (can be called from gossiping and when retreiving blocks)m.attemptApplyCachedBlocks
fromprocessNextDABatch
tosyncUntilTarget
Close #658
<-- Briefly describe the content of this pull request -->
For Author:
godoc
commentsFor Reviewer:
After reviewer approval: