-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(sync): Fix client sync RwLock deadlock and client block request stall #3321
fix(sync): Fix client sync RwLock deadlock and client block request stall #3321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👌.
There are likely other scenarios where we are triggering some of the logic too eagerly and should utilize try_lock
in order to avoid calling it in rapid succession, though this case was probably the most problematic due to the dual guard setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We originally only had a advance_with_sync_blocks_lock
in the advance_with_sync_blocks
, but this should also be done on the periodic try_block_sync
. Great catch.
Can confirm as well that this fixes deadlock and stalls, great work! |
Motivation
We have observed two different kinds of stalls while we conduct client-sync acceptance tests on canarynet. These issues are most noticeable and reproducible when block response processing takes a sufficient amount of time. These two issues can be classified as follows:
process.write
DeadlockBackground
There are two places during block sync where
process
is write locked (atomic_speculate
andatomic_finalize
), and two places whereprocess
is read locked (check_execution_internal
andcheck_fee_internal
). Both of the write locks occur in the sametry_advancing_with_block_responses
in series incheck_next_block
andadd_next_block
.This code is primarily run during a
BlockResponse
message that invokesadvance_with_sync_blocks
. It has a guard before callingtry_advancing_with_block_responses
to ensure no block responses are trying to write in parallel (thus creating a deadlock).We have observed that this particular area in
atomic_post_ratify
can take quite a while (more than 5 seconds) while holding the write lock inatomic_speculate
.Client nodes have this loop that kicks off the block requests firing every 5 seconds that coincidentally also calls
try_advancing_with_block_responses
but without guards to check if another try_advance is running from the aforementioned block responses.If an
atomic_speculate
took long enough and continued into the 5 second client sync loop while no block requests were active, this issue would occur and result in a permanent stall without block advancement until the node is restarted.Fix
We added a guard to the
try_block_sync
and stopped running into this issueBlock Request Starvation
Background
In order for block requests to be created for a given height it must meet the following criteria (outlined in
check_block_request
):Existing block requests, responses, and timestamps are removed given the following (issue relevant) criteria:
If the last peer for a block request disconnects and the block request and timestamp is removed, a block response will be left in a cycle where new requests will not be created and the request will not be considered complete for continuing block advancement. This condition results in a permanent stall without block advancement until the node is restarted.
If that alone is fixed, the node will still reach a state where request timestamps are retained until timeout as an "incomplete request" would be one where the request's peer set exists and is not empty OR one where the peer set has been removed (technically empty and not incomplete). This condition results in repeated temporary stalls until the block request is timed out as new block requests will not be created until the
request_timestamp
is timed outFix
By flipping the
unwrap_or(false)
tounwrap_or(true)
inrequests.get(height).map(|(_, _, peer_ips)| peer_ips.is_empty()).unwrap_or(false)
, we are treating the emptypeer_ips
set as a "completed" peer request and allowing both block advancement to continue and real incomplete blocks to be removedTest Plan
We have successfully deployed this and synced 7 clients (1 alone on a 3995WX, 2 on a shared 2xEPYC 9354, and 4 on standalone GCP 16vCPU 64GB
e2-standard-16
instances) through canary blocks 28,000 through 35,000 that were particularly stressful on clients.Last week we ran a 100 client sync test with constant stalling as low as block 8000
Today we have two pairs of 100 clients syncing the troublesome blocks right now
The stall in this screenshot is discovering and fixing the temporary stall issue
Note that the servers running these clients are still properly syncing without stalls despite both being under spec and running 10 clients each...
Related PRs