fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

Meshiest · 2024-06-19T22:54:56Z

Motivation

We have observed two different kinds of stalls while we conduct client-sync acceptance tests on canarynet. These issues are most noticeable and reproducible when block response processing takes a sufficient amount of time. These two issues can be classified as follows:

process.write deadlock during block response
block request starvation

`process.write` Deadlock

Background

There are two places during block sync where process is write locked (atomic_speculate and atomic_finalize), and two places where process is read locked (check_execution_internal and check_fee_internal). Both of the write locks occur in the same try_advancing_with_block_responses in series in check_next_block and add_next_block.

This code is primarily run during a BlockResponse message that invokes advance_with_sync_blocks. It has a guard before calling try_advancing_with_block_responses to ensure no block responses are trying to write in parallel (thus creating a deadlock).

We have observed that this particular area in atomic_post_ratify can take quite a while (more than 5 seconds) while holding the write lock in atomic_speculate.

Client nodes have this loop that kicks off the block requests firing every 5 seconds that coincidentally also calls try_advancing_with_block_responses but without guards to check if another try_advance is running from the aforementioned block responses.

If an atomic_speculate took long enough and continued into the 5 second client sync loop while no block requests were active, this issue would occur and result in a permanent stall without block advancement until the node is restarted.

Fix

We added a guard to the try_block_sync and stopped running into this issue

Block Request Starvation

Background

In order for block requests to be created for a given height it must meet the following criteria (outlined in check_block_request):

Height must be greater than current canon height
A request for this height must not already exist
A response for this height must not already exist
A request timestamp for this height must not already exist

Existing block requests, responses, and timestamps are removed given the following (issue relevant) criteria:

A peer disconnects and all associated block requests and timestamps are removed
The next block is advancing and the request is "complete" by having an empty peer ips set but not if the request is missing, even though peer requests can be removed when the peer ips are empty
A peer provides a block response, which removes the peer ip from the request set without removing the request.
A request is timed out or for a block that has been added already

If the last peer for a block request disconnects and the block request and timestamp is removed, a block response will be left in a cycle where new requests will not be created and the request will not be considered complete for continuing block advancement. This condition results in a permanent stall without block advancement until the node is restarted.

If that alone is fixed, the node will still reach a state where request timestamps are retained until timeout as an "incomplete request" would be one where the request's peer set exists and is not empty OR one where the peer set has been removed (technically empty and not incomplete). This condition results in repeated temporary stalls until the block request is timed out as new block requests will not be created until the request_timestamp is timed out

Fix

By flipping the unwrap_or(false) to unwrap_or(true) in requests.get(height).map(|(_, _, peer_ips)| peer_ips.is_empty()).unwrap_or(false), we are treating the empty peer_ips set as a "completed" peer request and allowing both block advancement to continue and real incomplete blocks to be removed

Test Plan

We have successfully deployed this and synced 7 clients (1 alone on a 3995WX, 2 on a shared 2xEPYC 9354, and 4 on standalone GCP 16vCPU 64GB e2-standard-16 instances) through canary blocks 28,000 through 35,000 that were particularly stressful on clients.

Last week we ran a 100 client sync test with constant stalling as low as block 8000

Today we have two pairs of 100 clients syncing the troublesome blocks right now

The stall in this screenshot is discovering and fixing the temporary stall issue

Note that the servers running these clients are still properly syncing without stalls despite both being under spec and running 10 clients each...

Related PRs

Helps with [Bug] Deadlock with rayon usage #3063
Fixes [Bug] snarkOS sometimes stops syncing #2916
Fixes [Bug] Sync module has racing condition on some checks #2978

ljedrz

LGTM 👌.

There are likely other scenarios where we are triggering some of the logic too eagerly and should utilize try_lock in order to avoid calling it in rapid succession, though this case was probably the most problematic due to the dual guard setup.

raychu86

LGTM.

We originally only had a advance_with_sync_blocks_lock in the advance_with_sync_blocks, but this should also be done on the periodic try_block_sync. Great catch.

HarukaMa · 2024-06-21T01:33:45Z

Can confirm as well that this fixes deadlock and stalls, great work!

apruden2008 · 2024-06-21T15:56:36Z

Thanks @Meshiest @damons for the great work here.

@zosorock this is good to merge pending CI passing

fix(sync): fix client sync mutex deadlock and client block request stall

67b11b4

Meshiest changed the title ~~fix(sync): fix client sync mutex deadlock and client block request stall~~ fix(sync): fix client sync rwlock deadlock and client block request stall Jun 19, 2024

Meshiest changed the title ~~fix(sync): fix client sync rwlock deadlock and client block request stall~~ fix(sync): Fix client sync RwLock deadlock and client block request stall Jun 20, 2024

vicsn requested review from niklaslong and ljedrz June 20, 2024 07:44

ljedrz approved these changes Jun 20, 2024

View reviewed changes

niklaslong approved these changes Jun 20, 2024

View reviewed changes

raychu86 approved these changes Jun 21, 2024

View reviewed changes

joske approved these changes Jun 21, 2024

View reviewed changes

zosorock merged commit fea092c into ProvableHQ:mainnet-staging Jun 21, 2024
24 checks passed

Meshiest deleted the fix-client-stall branch June 21, 2024 18:16

Meshiest mentioned this pull request Jun 21, 2024

[Bug] Potential temporary sync stall from weak-link or malicious peers #3322

Open

zosorock added the bug Incorrect or unexpected behavior label Jun 22, 2024

zosorock mentioned this pull request Jun 23, 2024

Canary v0.3.0 release #3325

Merged

raychu86 mentioned this pull request Nov 19, 2024

[Bug] Deadlock with rayon usage #3063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

Meshiest commented Jun 19, 2024 •

edited

Loading

ljedrz left a comment

raychu86 left a comment

HarukaMa commented Jun 21, 2024

apruden2008 commented Jun 21, 2024

fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

fix(sync): Fix client sync RwLock deadlock and client block request stall #3321

Conversation

Meshiest commented Jun 19, 2024 • edited Loading

Motivation

process.write Deadlock

Background

Fix

Block Request Starvation

Background

Fix

Test Plan

Related PRs

ljedrz left a comment

Choose a reason for hiding this comment

raychu86 left a comment

Choose a reason for hiding this comment

HarukaMa commented Jun 21, 2024

apruden2008 commented Jun 21, 2024

Meshiest commented Jun 19, 2024 •

edited

Loading

`process.write` Deadlock