You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a head chain sync starts, if there isn't any peers on data column subnets, it will not progress and gets stuck unless some external events triggers it to resume (e.g. new peer added).
This scenario is easily reproducible with PeerDAS, because we don't immediately know the peer's custody_group_count until we get a metadata response back from them.
The sequence of events I observed:
Connected to multiple peers with advanced sync state, so we start a new head chain sync (log New chain added to sync, New head chain started syncing)
And immediately, sync thinks there isn't any peers on its custody data column subnets because we haven't obtained their metadata yet and don't know their custody count, therefore no batch gets sent (logWaiting for peers to be available on custody column subnets)
Later, we obtained their metadata (log Obtained peer's metadata), but this doesn't re-trigger range sync, so we're stuck until finalized sync kicks in.
A few possible solutions (not mutually exclusive):
Compute peer_info.custody_subnets when peer is connected, and using the minimum custody requirement, as every peer must serve the minimum required column count - this way we're likely to have some peers in data column subnets even before obtaining metadata.
Update sync when we obtain peer metadata, and trigger resume
Additional Info
Range sync currently rely on this check syncing_chain.good_peers_on_sampling_subnets before requesting batches from peer.
if !self.good_peers_on_sampling_subnets(self.to_be_downloaded, network){
debug!(
self.log,
"Waiting for peers to be available on custody column subnets"
);
returnNone;
}
This is a workaround to avoid sending out excessive block requests because block and data column requests are currently coupled. In the where we request a batch, and there's no peers on the required column subnet, the blocks request will be sent but the data columns by range won't, and will fail with RpcRequestSendError::NoCustodyPeers. This will trigger retry and the node end up sending excessive blocks by range requests to peers without progressing. Longer term solution is to decouple the ByRange requests (#6258).
The text was updated successfully, but these errors were encountered:
Issue
When a head chain sync starts, if there isn't any peers on data column subnets, it will not progress and gets stuck unless some external events triggers it to resume (e.g. new peer added).
This scenario is easily reproducible with PeerDAS, because we don't immediately know the peer's
custody_group_count
until we get a metadata response back from them.The sequence of events I observed:
New chain added to sync
,New head chain started syncing
)Waiting for peers to be available on custody column subnets
)Obtained peer's metadata
), but this doesn't re-trigger range sync, so we're stuck until finalized sync kicks in.A few possible solutions (not mutually exclusive):
peer_info.custody_subnets
when peer is connected, and using the minimum custody requirement, as every peer must serve the minimum required column count - this way we're likely to have some peers in data column subnets even before obtaining metadata.Additional Info
Range sync currently rely on this check
syncing_chain.good_peers_on_sampling_subnets
before requesting batches from peer.lighthouse/beacon_node/network/src/sync/range_sync/chain.rs
Line 1075 in 7d54a43
lighthouse/beacon_node/network/src/sync/range_sync/chain.rs
Lines 444 to 448 in 7d54a43
lighthouse/beacon_node/network/src/sync/range_sync/chain.rs
Lines 1175 to 1181 in 7d54a43
This is a workaround to avoid sending out excessive block requests because block and data column requests are currently coupled. In the where we request a batch, and there's no peers on the required column subnet, the blocks request will be sent but the data columns by range won't, and will fail with
RpcRequestSendError::NoCustodyPeers
. This will trigger retry and the node end up sending excessive blocks by range requests to peers without progressing. Longer term solution is to decouple theByRange
requests (#6258).The text was updated successfully, but these errors were encountered: