You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After coming back to it to add some features, I'm not happy with the LineReaderAt interface and to some extent the use of io.ReaderAt. This is mostly for text patches, since io.ReaderAt is actually an ideal interface for the needs of binary patches.
Things I don't like:
It's hard to know if you are at the end of the input or not. You have to read a minimal amount of data at what you think is the end offset and see if you get more data or an io.EOF.
It's hard to know how large the input is. As above, you have to read at what you think the length is and see if you get more data or an io.EOF.
The implementation of LineReaderAt wrapping an io.ReaderAt feels complicated, but maybe this is inevitable when you need to build a line index dynamically
It's hard to control the memory usage when reading lines because you can set a number of lines, but have no control over the size of each line.
Any solution needs to solve the following constraints:
Support random access to lines. Strict apply could work without this, but it's required for fuzzy apply, where you slowly backtrack through the file to find a match.
Is a standard library type or can be created from a standard library type, the more widely implemented the better.
Allows end users some control over performance and memory usage for special cases.
Things I've considered:
io.ReaderAt and LineReaderAt: this works well for binary applies (it's the minimal method needed to implement them), but has the problems outlined above for text applies.
io.ReadSeeker: this enables the same features as io.ReaderAt (and is implemented by the same standard library types) but the position tracking and Read function make some things (like copying) easier. Since I don't plan to support concurrent use of the same source, I'm not sure if there's a major difference between using Read and Seek versus using ReadAt.
[]byte: this is simple and supports random access, but doesn't allow much flexibility. The whole source must be in memory and the apply functions will compute the line index as needed even if there was a more efficient way to get it. On the other hand, it reduces the need for internal buffers, so the number of allocations is probably lower. For what it's worth, git takes this approach and reads the full source file into memory for applies.
In my usage so far, everything is already in memory for other reasons, so the []byte might be the simplest. Or maybe io.ReaderAt is the correct interface and I just need a better abstraction on top of it for line operations.
The text was updated successfully, but these errors were encountered:
In light of #30, I've been thinking about this more and am now planning something like this:
All apply functions take an io.Reader which may implement additional interfaces
If the reader has a Bytes() []byte function (like bytes.Buffer and a planned gitdiff.BytesReader type), use it to get a byte slice and then perform all operations using that slice.
If the reader implements io.ReaderAt, use that (possibly in combination with Read), similar to the current approach
If neither (2) nor (3) are true, read the entire content into a []byte and then proceed as in (2)
As part of this, I'll drop LineReaderAt and replace it (at least, conceptually) with:
typeReaderAtLineinterface {
ReadAtLine(p []byte, lineint) (nint, index []int, errerror)
}
It reads up to len(p) bytes starting at the given line and returns the number of bytes read and a slice giving the (relative) index of each line in p. I think this will work because we should always know how many bytes to expect using the context and old lines of the patch fragment. The first element in index is always 0 and the last element is the first byte after the last line. If this is greater than len(p), it means the last line in p is partial.
I'm not sure if this will be a public type yet. But there will probably be an internal variant that returns the []byte instead of taking it as an argument. This is so that we can read directly from the input data in cases (2) and (4) without making copies.
Of the original constraints, I'm now interpreting
Allows end users some control over performance and memory usage for special cases.
to mean "can apply a change to a large file without holding the whole file in memory." I think offering control beyond that (e.g. over buffer sizes and allocations) will add too much complexity for limited use.
After coming back to it to add some features, I'm not happy with the
LineReaderAt
interface and to some extent the use ofio.ReaderAt
. This is mostly for text patches, sinceio.ReaderAt
is actually an ideal interface for the needs of binary patches.Things I don't like:
io.EOF
.io.EOF
.LineReaderAt
wrapping anio.ReaderAt
feels complicated, but maybe this is inevitable when you need to build a line index dynamicallyAny solution needs to solve the following constraints:
Things I've considered:
io.ReaderAt
andLineReaderAt
: this works well for binary applies (it's the minimal method needed to implement them), but has the problems outlined above for text applies.io.ReadSeeker
: this enables the same features asio.ReaderAt
(and is implemented by the same standard library types) but the position tracking andRead
function make some things (like copying) easier. Since I don't plan to support concurrent use of the same source, I'm not sure if there's a major difference between usingRead
andSeek
versus usingReadAt
.[]byte
: this is simple and supports random access, but doesn't allow much flexibility. The whole source must be in memory and the apply functions will compute the line index as needed even if there was a more efficient way to get it. On the other hand, it reduces the need for internal buffers, so the number of allocations is probably lower. For what it's worth,git
takes this approach and reads the full source file into memory for applies.In my usage so far, everything is already in memory for other reasons, so the
[]byte
might be the simplest. Or maybeio.ReaderAt
is the correct interface and I just need a better abstraction on top of it for line operations.The text was updated successfully, but these errors were encountered: