Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Allow task failure on disk imports (#1002)
* Allow task failure on disk imports Previously, when `oxide` initiated many concurrent uploads, network congestion could cause packet loss and timeouts for some worker tasks. While we added the undocumented `--parallelism` flag as a temporary workaround, this required users to manually diagnose network issues and adjust settings themselves. This commit attempts to make disk imports more reliable by allowing the job to continue when a subset of upload tasks have failed. When a worker encounters a network error, it reports the failed chunk's file offset back to the main task. The main task will retry these failed chunks after completing the initial upload attempt. This creates natural backpressure - as network congestion increases and more tasks fail, the number of concurrent uploads automatically decreases to a sustainable level. This requires a switch to a single `mpmc` channel shared between all worker tasks to avoid losing any buffered jobs when a task errors out. While we're at it, clean up disk imports with: * Replace synchronous file i/o with tokio fs * Remove redundant atomic progress variable The `watch` channel gives us a thread-safe way to communicate changes between tasks. Just use the `send_modify` method to increment the value and get rid of the unneeded `AtomicU64`. * Dedupe error messages Worker tasks are likely to encounter the same error when uploading chunks. Rather than listing the same error N times, dedupe them and show only unique errors. * Make upload_thread_ct NonZeroUsize It is invalid to set upload_thread_ct to zero. Update the API to enforce this constraint. * Rename upload_thread_ct to upload_task_ct The `upload_thread_ct` variable name is inaccurate, as we're creating tokio tasks to perform uploads, not OS threads. Correct the name to `upload_task_ct`. * Update copyright year --------- Co-authored-by: Adam Leventhal <[email protected]>
- Loading branch information