Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

Meshiest
Copy link
Contributor

@Meshiest Meshiest commented Jun 19, 2024

Motivation

We have observed two different kinds of stalls while we conduct client-sync acceptance tests on canarynet. These issues are most noticeable and reproducible when block response processing takes a sufficient amount of time. These two issues can be classified as follows:

  • process.write deadlock during block response
  • block request starvation

process.write Deadlock

Background

There are two places during block sync where process is write locked (atomic_speculate and atomic_finalize), and two places where process is read locked (check_execution_internal and check_fee_internal). Both of the write locks occur in the same try_advancing_with_block_responses in series in check_next_block and add_next_block.

This code is primarily run during a BlockResponse message that invokes advance_with_sync_blocks. It has a guard before calling try_advancing_with_block_responses to ensure no block responses are trying to write in parallel (thus creating a deadlock).

We have observed that this particular area in atomic_post_ratify can take quite a while (more than 5 seconds) while holding the write lock in atomic_speculate.

Client nodes have this loop that kicks off the block requests firing every 5 seconds that coincidentally also calls try_advancing_with_block_responses but without guards to check if another try_advance is running from the aforementioned block responses.

If an atomic_speculate took long enough and continued into the 5 second client sync loop while no block requests were active, this issue would occur and result in a permanent stall without block advancement until the node is restarted.

Fix

We added a guard to the try_block_sync and stopped running into this issue

Block Request Starvation

Background

In order for block requests to be created for a given height it must meet the following criteria (outlined in check_block_request):

  • Height must be greater than current canon height
  • A request for this height must not already exist
  • A response for this height must not already exist
  • A request timestamp for this height must not already exist

Existing block requests, responses, and timestamps are removed given the following (issue relevant) criteria:

If the last peer for a block request disconnects and the block request and timestamp is removed, a block response will be left in a cycle where new requests will not be created and the request will not be considered complete for continuing block advancement. This condition results in a permanent stall without block advancement until the node is restarted.

If that alone is fixed, the node will still reach a state where request timestamps are retained until timeout as an "incomplete request" would be one where the request's peer set exists and is not empty OR one where the peer set has been removed (technically empty and not incomplete). This condition results in repeated temporary stalls until the block request is timed out as new block requests will not be created until the request_timestamp is timed out

Fix

By flipping the unwrap_or(false) to unwrap_or(true) in requests.get(height).map(|(_, _, peer_ips)| peer_ips.is_empty()).unwrap_or(false), we are treating the empty peer_ips set as a "completed" peer request and allowing both block advancement to continue and real incomplete blocks to be removed

Test Plan

We have successfully deployed this and synced 7 clients (1 alone on a 3995WX, 2 on a shared 2xEPYC 9354, and 4 on standalone GCP 16vCPU 64GB e2-standard-16 instances) through canary blocks 28,000 through 35,000 that were particularly stressful on clients.

Last week we ran a 100 client sync test with constant stalling as low as block 8000
image

Today we have two pairs of 100 clients syncing the troublesome blocks right now
image

The stall in this screenshot is discovering and fixing the temporary stall issue
image

Note that the servers running these clients are still properly syncing without stalls despite both being under spec and running 10 clients each...

Related PRs

@Meshiest Meshiest changed the title fix(sync): fix client sync mutex deadlock and client block request stall fix(sync): fix client sync rwlock deadlock and client block request stall Jun 19, 2024
@Meshiest Meshiest changed the title fix(sync): fix client sync rwlock deadlock and client block request stall fix(sync): Fix client sync RwLock deadlock and client block request stall Jun 20, 2024
@vicsn vicsn requested review from niklaslong and ljedrz June 20, 2024 07:44
Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👌.

There are likely other scenarios where we are triggering some of the logic too eagerly and should utilize try_lock in order to avoid calling it in rapid succession, though this case was probably the most problematic due to the dual guard setup.

Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We originally only had a advance_with_sync_blocks_lock in the advance_with_sync_blocks, but this should also be done on the periodic try_block_sync. Great catch.

@HarukaMa
Copy link
Contributor

Can confirm as well that this fixes deadlock and stalls, great work!

@apruden2008
Copy link
Contributor

Thanks @Meshiest @damons for the great work here.

@zosorock this is good to merge pending CI passing

@zosorock zosorock merged commit fea092c into ProvableHQ:mainnet-staging Jun 21, 2024
24 checks passed
@Meshiest Meshiest deleted the fix-client-stall branch June 21, 2024 18:16
@zosorock zosorock added the bug Incorrect or unexpected behavior label Jun 22, 2024
@zosorock zosorock mentioned this pull request Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect or unexpected behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Sync module has racing condition on some checks [Bug] snarkOS sometimes stops syncing
8 participants