Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Fallback RPC when the primary Sync data-source stalled #465

Open
wants to merge 5 commits into
base: dz/fallback-rpc
Choose a base branch
from

Conversation

DZakh
Copy link
Member

@DZakh DZakh commented Mar 6, 2025

  • Better tested
  • Multiple sync sources support
  • Improved logs
  • Fallback support
  • Increase polling interval for stalled chain

What's left:

  • Fix broken existing rpc fallback logic (when you provide multiple URLs)
  • Add wait for new block tests when the request throws
  • Include sanitized RPC URL to the source name to be able to distinguish sources in the logs

@DZakh DZakh requested a review from JonoPrest March 6, 2025 14:38
Comment on lines +12 to +23
let fetchNext: (
t,
~fetchState: FetchState.t,
~currentBlockHeight: int,
~executeQuery: (FetchState.query, ~source: Source.t) => promise<unit>,
~waitForNewBlock: (~currentBlockHeight: int) => promise<int>,
~onNewBlock: (~currentBlockHeight: int) => unit,
~maxPerChainQueueSize: int,
~stateId: int,
) => promise<unit>

let waitForNewBlock: (t, ~currentBlockHeight: int) => promise<int>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have them as two separate testable functions

~stalledPollingInterval: int=?,
) => t

let getActiveSource: t => Source.t
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonoPrest As you suggested, I stopped exposing the mutable field 👍

Comment on lines +198 to +207
logger->Logging.childError(
`No new blocks detected within ${(sourceManager.newBlockFallbackStallTimeout / 1000)
->Int.toString}s. Polling will continue at a reduced rate. For better reliability, refer to our RPC fallback guide: https://docs.envio.dev/docs/HyperIndex/rpc-sync`,
)
| _ =>
logger->Logging.childWarn(
`No new blocks detected within ${(sourceManager.newBlockFallbackStallTimeout / 1000)
->Int.toString}s. Continuing polling with fallback RPC sources from the configuration.`,
)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only show warning/error after 20s. Before this, all the other logs are traces.

| [] =>
logger->Logging.childError(
`No new blocks detected within ${(sourceManager.newBlockFallbackStallTimeout / 1000)
->Int.toString}s. Polling will continue at a reduced rate. For better reliability, refer to our RPC fallback guide: https://docs.envio.dev/docs/HyperIndex/rpc-sync`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to update this one, when there's a docs page 😁

Comment on lines +231 to +237
// Show a higher level log if we displayed a warning/error after newBlockFallbackStallTimeout
let log = status.contents === Stalled ? Logging.childInfo : Logging.childTrace
logger->log({
"msg": `New blocks successfully found.`,
"source": source.name,
"newBlockHeight": newBlockHeight,
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we discussed for other places to show a success log after a warning.

Comment on lines +163 to +167
source.sourceFor === Sync ||
// Even if the active source is a fallback, still include
// it to the list. So we don't wait for a timeout again
// if all main sync sources are still not valid
source === sourceManager.activeSource
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might cause problems for returning from fallback to HyperSync, but it may not be a big problem when we are at the head and fallback has smaller latency.

Comment on lines +137 to +140
~delayMilliseconds=Pervasives.max(
retryIntervalMillis.contents * backOffMultiplicative,
60_000,
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a safeguard to not exceed a minute :)

Comment on lines +120 to +126
let delayMilliseconds = if status.contents === Stalled {
retryIntervalMillis := stalledPollingInterval // Reset possible backOff
stalledPollingInterval
} else {
retryIntervalMillis := initalRetryIntervalMillis // Reset possible backOff
source.pollingInterval
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the most beautiful-looking code, but I well tested it 👍

Comment on lines +131 to +134
logger->Logging.childTrace({
"msg": `Height retrieval from ${source.name} source failed. Retrying in ${retryIntervalMillis.contents->Int.toString}ms.`,
"source": source.name,
"error": exn->ErrorHandling.prettifyExn,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a warning anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant