Reconsider source inputs for apply functions #10

bluekeyes · 2020-05-03T20:23:58Z

After coming back to it to add some features, I'm not happy with the LineReaderAt interface and to some extent the use of io.ReaderAt. This is mostly for text patches, since io.ReaderAt is actually an ideal interface for the needs of binary patches.

Things I don't like:

It's hard to know if you are at the end of the input or not. You have to read a minimal amount of data at what you think is the end offset and see if you get more data or an io.EOF.
It's hard to know how large the input is. As above, you have to read at what you think the length is and see if you get more data or an io.EOF.
The implementation of LineReaderAt wrapping an io.ReaderAt feels complicated, but maybe this is inevitable when you need to build a line index dynamically
It's hard to control the memory usage when reading lines because you can set a number of lines, but have no control over the size of each line.

Any solution needs to solve the following constraints:

Support random access to lines. Strict apply could work without this, but it's required for fuzzy apply, where you slowly backtrack through the file to find a match.
Is a standard library type or can be created from a standard library type, the more widely implemented the better.
Allows end users some control over performance and memory usage for special cases.

Things I've considered:

io.ReaderAt and LineReaderAt: this works well for binary applies (it's the minimal method needed to implement them), but has the problems outlined above for text applies.
io.ReadSeeker: this enables the same features as io.ReaderAt (and is implemented by the same standard library types) but the position tracking and Read function make some things (like copying) easier. Since I don't plan to support concurrent use of the same source, I'm not sure if there's a major difference between using Read and Seek versus using ReadAt.
[]byte: this is simple and supports random access, but doesn't allow much flexibility. The whole source must be in memory and the apply functions will compute the line index as needed even if there was a more efficient way to get it. On the other hand, it reduces the need for internal buffers, so the number of allocations is probably lower. For what it's worth, git takes this approach and reads the full source file into memory for applies.

In my usage so far, everything is already in memory for other reasons, so the []byte might be the simplest. Or maybe io.ReaderAt is the correct interface and I just need a better abstraction on top of it for line operations.

The text was updated successfully, but these errors were encountered:

bluekeyes · 2022-01-24T06:47:37Z

In light of #30, I've been thinking about this more and am now planning something like this:

All apply functions take an io.Reader which may implement additional interfaces
If the reader has a Bytes() []byte function (like bytes.Buffer and a planned gitdiff.BytesReader type), use it to get a byte slice and then perform all operations using that slice.
If the reader implements io.ReaderAt, use that (possibly in combination with Read), similar to the current approach
If neither (2) nor (3) are true, read the entire content into a []byte and then proceed as in (2)

As part of this, I'll drop LineReaderAt and replace it (at least, conceptually) with:

type ReaderAtLine interface {
  ReadAtLine(p []byte, line int) (n int, index []int, err error) 
}

It reads up to len(p) bytes starting at the given line and returns the number of bytes read and a slice giving the (relative) index of each line in p. I think this will work because we should always know how many bytes to expect using the context and old lines of the patch fragment. The first element in index is always 0 and the last element is the first byte after the last line. If this is greater than len(p), it means the last line in p is partial.

I'm not sure if this will be a public type yet. But there will probably be an internal variant that returns the []byte instead of taking it as an argument. This is so that we can read directly from the input data in cases (2) and (4) without making copies.

Of the original constraints, I'm now interpreting

Allows end users some control over performance and memory usage for special cases.

to mean "can apply a change to a large file without holding the whole file in memory." I think offering control beyond that (e.g. over buffer sizes and allocations) will add too much complexity for limited use.

bluekeyes added the applying Issues related to applying patches label May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider source inputs for apply functions #10

Reconsider source inputs for apply functions #10

bluekeyes commented May 3, 2020

bluekeyes commented Jan 24, 2022 •

edited

Loading

Reconsider source inputs for apply functions #10

Reconsider source inputs for apply functions #10

Comments

bluekeyes commented May 3, 2020

bluekeyes commented Jan 24, 2022 • edited Loading

bluekeyes commented Jan 24, 2022 •

edited

Loading