Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider source inputs for apply functions #10

Open
bluekeyes opened this issue May 3, 2020 · 1 comment
Open

Reconsider source inputs for apply functions #10

bluekeyes opened this issue May 3, 2020 · 1 comment
Labels
applying Issues related to applying patches

Comments

@bluekeyes
Copy link
Owner

After coming back to it to add some features, I'm not happy with the LineReaderAt interface and to some extent the use of io.ReaderAt. This is mostly for text patches, since io.ReaderAt is actually an ideal interface for the needs of binary patches.

Things I don't like:

  • It's hard to know if you are at the end of the input or not. You have to read a minimal amount of data at what you think is the end offset and see if you get more data or an io.EOF.
  • It's hard to know how large the input is. As above, you have to read at what you think the length is and see if you get more data or an io.EOF.
  • The implementation of LineReaderAt wrapping an io.ReaderAt feels complicated, but maybe this is inevitable when you need to build a line index dynamically
  • It's hard to control the memory usage when reading lines because you can set a number of lines, but have no control over the size of each line.

Any solution needs to solve the following constraints:

  • Support random access to lines. Strict apply could work without this, but it's required for fuzzy apply, where you slowly backtrack through the file to find a match.
  • Is a standard library type or can be created from a standard library type, the more widely implemented the better.
  • Allows end users some control over performance and memory usage for special cases.

Things I've considered:

  • io.ReaderAt and LineReaderAt: this works well for binary applies (it's the minimal method needed to implement them), but has the problems outlined above for text applies.

  • io.ReadSeeker: this enables the same features as io.ReaderAt (and is implemented by the same standard library types) but the position tracking and Read function make some things (like copying) easier. Since I don't plan to support concurrent use of the same source, I'm not sure if there's a major difference between using Read and Seek versus using ReadAt.

  • []byte: this is simple and supports random access, but doesn't allow much flexibility. The whole source must be in memory and the apply functions will compute the line index as needed even if there was a more efficient way to get it. On the other hand, it reduces the need for internal buffers, so the number of allocations is probably lower. For what it's worth, git takes this approach and reads the full source file into memory for applies.

In my usage so far, everything is already in memory for other reasons, so the []byte might be the simplest. Or maybe io.ReaderAt is the correct interface and I just need a better abstraction on top of it for line operations.

@bluekeyes bluekeyes added the applying Issues related to applying patches label May 3, 2020
@bluekeyes
Copy link
Owner Author

bluekeyes commented Jan 24, 2022

In light of #30, I've been thinking about this more and am now planning something like this:

  1. All apply functions take an io.Reader which may implement additional interfaces
  2. If the reader has a Bytes() []byte function (like bytes.Buffer and a planned gitdiff.BytesReader type), use it to get a byte slice and then perform all operations using that slice.
  3. If the reader implements io.ReaderAt, use that (possibly in combination with Read), similar to the current approach
  4. If neither (2) nor (3) are true, read the entire content into a []byte and then proceed as in (2)

As part of this, I'll drop LineReaderAt and replace it (at least, conceptually) with:

type ReaderAtLine interface {
  ReadAtLine(p []byte, line int) (n int, index []int, err error) 
}

It reads up to len(p) bytes starting at the given line and returns the number of bytes read and a slice giving the (relative) index of each line in p. I think this will work because we should always know how many bytes to expect using the context and old lines of the patch fragment. The first element in index is always 0 and the last element is the first byte after the last line. If this is greater than len(p), it means the last line in p is partial.

I'm not sure if this will be a public type yet. But there will probably be an internal variant that returns the []byte instead of taking it as an argument. This is so that we can read directly from the input data in cases (2) and (4) without making copies.

Of the original constraints, I'm now interpreting

Allows end users some control over performance and memory usage for special cases.

to mean "can apply a change to a large file without holding the whole file in memory." I think offering control beyond that (e.g. over buffer sizes and allocations) will add too much complexity for limited use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
applying Issues related to applying patches
Projects
None yet
Development

No branches or pull requests

1 participant