-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement optional concurrent "Range" requests (refs #86) #102
base: master
Are you sure you want to change the base?
Conversation
…se a "Range" request
I've realised that I have left the returned |
If multiple goroutines are writing to the output in parallel, will the output file have gaps in between partially written chunks? If so, how would resuming a partial download work? |
@ananthb I think it is possible in the current implementation for the resume to not be correct, given a situation where a later range concurrently finishes sooner than an earlier range and then the transfer is stopped. The reason would be that as each concurrent range completes writing, it atomically adds to the total bytes written. So if 10,20,40 finish, but 30 does not, it would report 30 total written but there would be a gap, and the file itself would look like it had |
@ananthb I've just pushed 8b4f8d2 to address the support for resume with ranged requests. It will now truncate the file to the end of the lowest successful range offset before a failure to avoid any gaps. This means you may lose progress on some chunks that concurrently finished just after the failed range. |
Yep that makes sense @justinfx. |
What happens if grab crashes before it can truncate the file? If you only wrote completed chunks to the file, then you wouldn't need to truncate it in the first place. |
@ananthb yea if the process crashed before it could truncate, it would leave a file that could be corrupt and then used for resume. I definitely don't think we should buffer in memory because that could have surprising resource usage implications on large files.
|
Yeah in-memory buffers alone won't be enough to cover, say large chunk sizes or many parallel chunks. I've been toying with the idea of an in-memory buffer that spills over onto disk. My basic idea to make resume work is that the output file should always be consistent and not have any "holes". Basically write only completed chunks in order to the output file. I could use anonymous chunks to buffer in-progress chunks and too. |
The current implementation splits large chunks by the number of parallel workers, so I wonder if your idea could manage to avoid large memory usage. You might have to buffer alot before a gap closed. |
@justinfx I wrote a library called chonker that does Range GETs transparently. |
@ananthb nice one. It's been a while and I have moved on from this issue, having made use of it in a fork with this feature |
This is an implementation of an optional feature to have a
Request
download the payload in multiple chunks, using a "Range" request, if supported by the server.API updates
The
Request
struct gains a new field calledRangeRequestMax
, which when set to > 0 controls how many chunks to download in parallel using a "Range" request, instead of a single request reading the full body.Implementation details
High level steps:
RangeRequestMax
> 0transfer
implementation, calledtransferRanges
copyFile
works the same as beforeGiven the way the state machine works, it seemed easier to launch the concurrent range requests during the copy phase, instead of in the synchronous
getRequest
state and have to monitor a list of requests.A new
transferer
interface has been introduced, to have a second implementation calledtransferRanges
transferRanges
implementationThis alternate implementation handles launching the concurrent range requests, doing the copy of the data, and tracking the metrics.
It seemed easier to just pass the HEAD
Response
to the private constructor, as most of the needed details are present on that struct.transferRanges.Copy()
will start a number of Range requests with anoffset-limit
in goroutines, per the value ofRangeRequestMax
passed in. Each goroutines writes directly to the open file writer usingWriteAt
(with underlyingpwrite()
syscall) to write chunks of data at offsets to the same file descriptor in parallel. Metrics are atomically updated to keepN()
andBPS()
working.Other details
I did also update the default setting in the
Client
when it comes to setting "TCP_NODELAY". In Go standard lib, this is set totrue
in thenet.TCPConn
to target rpc workloads. But it would make more sense, in theory, to disable Nagles Algorithm by default in a library specific to downloading files over HTTP. It can still be controlled by the user supplying a customHTTPClient