Skip to content

Commit

Permalink
[#1351] Recover from block stream issues
Browse files Browse the repository at this point in the history
- changelog updated
- block stream errors are now handled as a special case of error, retry logic is triggered but at most 3-times in case of service being truly down
- the failure is not passed to the clients so ideally the false positive errors are reduced as well as the delay in the sync time
  • Loading branch information
LukasKorba committed Jan 25, 2024
1 parent cf6a1e7 commit ef040ff
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 11 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ and this library adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### [#1346] Troubleshooting synchronization
We focused on performance of the synchronization and found out a root cause in progress reporting. Simple change reduced the synchronization significantly by reporting less frequently. This affect the UX a bit because the % of the sync is updated only every 500 scanned blocks instead of every 100. Proper solution is going to be handled in #1353.

### [#1351] Recover from block stream issues
Async block stream grpc calls sometimes fail with unknown error 14, most of the times represented as `Transport became inactive` or `NIOHTTP2.StreamClosed`. Unless the service is truly down, these errors are usually false positive ones. The SDK was able to recover from this error with the next sync triggered but tt takes 10-30s to happen. This delay is unnecessary so we made 2 changes. When these errors are caught the next sync is triggered immediately (at most 3 times) + the error state is not passed to the clients.

# 2.0.5 - 2023-12-15

## Added
Expand Down
29 changes: 18 additions & 11 deletions Sources/ZcashLightClientKit/Block/CompactBlockProcessor.swift
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ actor CompactBlockProcessor {
private let fileManager: ZcashFileManager

private var retryAttempts: Int = 0
private var blockStreamRetryAttempts: Int = 0
private var backoffTimer: Timer?
private var consecutiveChainValidationErrors: Int = 0

Expand Down Expand Up @@ -263,6 +264,7 @@ extension CompactBlockProcessor {
func start(retry: Bool = false) async {
if retry {
self.retryAttempts = 0
self.blockStreamRetryAttempts = 0
self.backoffTimer?.invalidate()
self.backoffTimer = nil
}
Expand All @@ -289,6 +291,7 @@ extension CompactBlockProcessor {
self.backoffTimer = nil
await stopAllActions()
retryAttempts = 0
blockStreamRetryAttempts = 0
}

func latestHeight() async throws -> BlockHeight {
Expand Down Expand Up @@ -530,7 +533,17 @@ extension CompactBlockProcessor {
await stopAllActions()
logger.error("Sync failed with error: \(error)")

if Task.isCancelled {
// catching the block stream error
if case ZcashError.serviceBlockStreamFailed = error, self.blockStreamRetryAttempts < ZcashSDK.blockStreamRetries {
// This may be false positive communication error that is usually resolved by retry.
// We will try to reset the sync and continue but this will we done at most `ZcashSDK.blockStreamRetries` times.
logger.error("ZcashError.serviceBlockStreamFailed, retry is available, starting the sync all over again.")

self.blockStreamRetryAttempts += 1

// Start sync all over again
await resetContext()
} else if Task.isCancelled {
logger.info("Processing cancelled.")
do {
if try await syncTaskWasCancelled() {
Expand All @@ -545,13 +558,8 @@ extension CompactBlockProcessor {
break
}
} else {
if await handleSyncFailure(action: action, error: error) {
// Start sync all over again
await resetContext()
} else {
// end the sync loop
break
}
await handleSyncFailure(action: action, error: error)
break
}
}
}
Expand All @@ -567,15 +575,13 @@ extension CompactBlockProcessor {
return try await handleAfterSyncHooks()
}

private func handleSyncFailure(action: Action, error: Error) async -> Bool {
private func handleSyncFailure(action: Action, error: Error) async {
if action.removeBlocksCacheWhenFailed {
await ifTaskIsNotCanceledClearCompactBlockCache()
}

logger.error("Sync failed with error: \(error)")
await failure(error)

return false
}

// swiftlint:disable:next cyclomatic_complexity
Expand Down Expand Up @@ -642,6 +648,7 @@ extension CompactBlockProcessor {
latestBlockHeightWhenSyncing > 0 && latestBlockHeightWhenSyncing < latestBlockHeight

retryAttempts = 0
blockStreamRetryAttempts = 0
consecutiveChainValidationErrors = 0

let lastScannedHeight = await latestBlocksDataProvider.maxScannedHeight
Expand Down
5 changes: 5 additions & 0 deletions Sources/ZcashLightClientKit/Constants/ZcashSDK.swift
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,11 @@ public enum ZcashSDK {
// TODO: [#1304] smart retry logic, https://github.com/zcash/ZcashLightClientKit/issues/1304
public static let defaultRetries = Int.max

/// The communication errors are represented as serviceBlockStreamFailed : LightWalletServiceError, unavailable 14
/// These cases are usually false positive and another try will continue the work, in case the service is trully down we
/// cap the amount of retries by this value.
public static let blockStreamRetries = 3

/// The default maximum amount of time to wait during retry backoff intervals. Failed loops will never wait longer than
/// this before retrying.
public static let defaultMaxBackOffInterval: TimeInterval = 600
Expand Down

0 comments on commit ef040ff

Please sign in to comment.