Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync error handling #133

Open
wants to merge 15 commits into
base: unstable
Choose a base branch
from
Open

Conversation

Zacholme7
Copy link
Member

Issue Addressed

#119

Proposed Changes

This PR introduces a way to handle rpc errors and signal if sync is stalled.

If a websocket goes down or an rpc endpoint is having issues,OPERATIONAL_STATUS is set to false. The rest of the application can be conditioned on this value to determine if the execution layer is having sync issues.

If there is an rpc error, there is nothing we can do until the endpoint is operational again. The simplest way to test this is just to continuously poll for a block number with exponential backoffs. There is no longer a set number of retries. It just keeps retrying until it is valid again.

@Zacholme7 Zacholme7 added enhancement New feature or request execution layer labels Feb 11, 2025
anchor/eth/src/sync.rs Outdated Show resolved Hide resolved
@Zacholme7 Zacholme7 marked this pull request as ready for review February 14, 2025 13:22
@dknopik
Copy link
Member

dknopik commented Feb 14, 2025

I retested it locally. Could not get it to crash now. Nice!

But I noticed that right now, this error condition does not properly apply any backoff:

// If we get here, the stream ended (likely due to disconnect)
error!("WebSocket stream ended, reconnecting...");

This got me thinking once more about the approach. How do you feel about, instead of having to apply troubleshoot_rpc at multiple appropriate locations throughout the file, just returning an error in the live and historical functions and doing the reconnect logic at the top level? This also handles cases where we want to fall back to historical sync logic after being offline for a considerable amount of time, and helps us assert that we really do not crash out anymore if something fails. There might be some disadvantages I am not thinking of right now though.

@Zacholme7
Copy link
Member Author

@dknopik Threw together a quick POC. I think I that approach works very nice. Take a look when you have a sec and if thats along the lines of what you were thinking ill clean it up and make sure it works.

.await
.expect("Failed to construct event syncer");
let mut event_syncer =
SsvEventSyncer::new(db.clone(), config, Arc::new(AtomicBool::new(false)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be bit cleaner to create the Arc within the new method.

@dknopik
Copy link
Member

dknopik commented Feb 14, 2025

Threw together a quick POC. I think I that approach works very nice. Take a look when you have a sec and if thats along the lines of what you were thinking ill clean it up and make sure it works.

yea, seems good! This is what I meant. :​)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request execution layer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants