Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomAccessStream + ReaderInfo #131

Open
kwhopper opened this issue May 4, 2018 · 1 comment
Open

RandomAccessStream + ReaderInfo #131

kwhopper opened this issue May 4, 2018 · 1 comment

Comments

@kwhopper
Copy link
Collaborator

kwhopper commented May 4, 2018

Would like some review and comments on #130.

This library has had numerous issues because of a dependence on byte arrays for internal processing. Segments are generally read into a byte buffer for parsing out tags and values and any sub-ifds or segments are read into other byte arrays. Because of this, the global/relative position of those bytes has been hard to track. We have tried adding things like Offset to JpegSegment, but the meaning of that value can get lost. This makes it difficult to use offset-based processing, like saving out thumbnails, and makes it virtually impossible to have the location accuracy for writing metadata back.

One possible solution occurred to me while looking through ExifTool source code. While it does use a ton of byte arrays, the properties of the current operation are stored in an object to tell processors downstream where to start and the limitations. The new ReaderInfo class brings that general idea to this library. It has these properties:

  • Always knows its own global StartPosition inside the stream or array
  • Tracks its own Length
  • Supports index- and sequential-based processing by tracking LocalPosition relative to StartPosition
  • GlobalPosition is always known by adding StartPosition to LocalPosition
  • Can be cloned with the current properties, relative to the current position, etc. Even if the new instance has the exact same properties, it is independent from that point forward
  • The need for byte buffers downstream is greatly reduced if this object is passed along
  • Doesn't know or care how the bytes are read by deferring to a RandomAccessStream

The last one brings up the other piece that also makes this possible. RandomAccessStream intends to replace all other readers in the current library. It's largely a buffered, indexed capturing reader implementation but takes into account whether the stream it's wrapping supports seeking. It reads in chunks with a (configurable in the future) length. If the stream is seekable, the reader seeks to n*chunklength index and reads. If the stream isn't seekable and a chunk greater than the first is requested, all chunks necessary to reach the desired chunk are read first (desirable for forward-only network streams, etc).

There are other things to say probably, but I'll skip to some stats. I ran the current metadata-extractor-dotnet FileProcessor on the images library and got back this:

Processed 1,359 files (read 252,143,934 of 2,061,394,286 bytes) with 228 exceptions and 594 file errors
Completed in 11,023.44 ms

The new code returned this:

Processed 1,359 files (read 88,054,811 of 2,061,394,286 bytes) with 214 exceptions and 593 file errors
Completed in 5,191.71 ms

Bytes read and processing time are much improved. Please let me know what you think. Again, the code is in PR #130

@kwhopper
Copy link
Collaborator Author

kwhopper commented May 4, 2018

Related to:
#35, #36, #62, #91, #122, #125
and possibly others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant