RandomAccessStream + ReaderInfo #131

kwhopper · 2018-05-04T15:35:54Z

Would like some review and comments on #130.

This library has had numerous issues because of a dependence on byte arrays for internal processing. Segments are generally read into a byte buffer for parsing out tags and values and any sub-ifds or segments are read into other byte arrays. Because of this, the global/relative position of those bytes has been hard to track. We have tried adding things like Offset to JpegSegment, but the meaning of that value can get lost. This makes it difficult to use offset-based processing, like saving out thumbnails, and makes it virtually impossible to have the location accuracy for writing metadata back.

One possible solution occurred to me while looking through ExifTool source code. While it does use a ton of byte arrays, the properties of the current operation are stored in an object to tell processors downstream where to start and the limitations. The new ReaderInfo class brings that general idea to this library. It has these properties:

Always knows its own global StartPosition inside the stream or array
Tracks its own Length
Supports index- and sequential-based processing by tracking LocalPosition relative to StartPosition
GlobalPosition is always known by adding StartPosition to LocalPosition
Can be cloned with the current properties, relative to the current position, etc. Even if the new instance has the exact same properties, it is independent from that point forward
The need for byte buffers downstream is greatly reduced if this object is passed along
Doesn't know or care how the bytes are read by deferring to a RandomAccessStream

The last one brings up the other piece that also makes this possible. RandomAccessStream intends to replace all other readers in the current library. It's largely a buffered, indexed capturing reader implementation but takes into account whether the stream it's wrapping supports seeking. It reads in chunks with a (configurable in the future) length. If the stream is seekable, the reader seeks to n*chunklength index and reads. If the stream isn't seekable and a chunk greater than the first is requested, all chunks necessary to reach the desired chunk are read first (desirable for forward-only network streams, etc).

There are other things to say probably, but I'll skip to some stats. I ran the current metadata-extractor-dotnet FileProcessor on the images library and got back this:

Processed 1,359 files (read 252,143,934 of 2,061,394,286 bytes) with 228 exceptions and 594 file errors
Completed in 11,023.44 ms

The new code returned this:

Processed 1,359 files (read 88,054,811 of 2,061,394,286 bytes) with 214 exceptions and 593 file errors
Completed in 5,191.71 ms

Bytes read and processing time are much improved. Please let me know what you think. Again, the code is in PR #130

kwhopper · 2018-05-04T15:37:51Z

Related to:
#35, #36, #62, #91, #122, #125
and possibly others

kwhopper mentioned this issue Sep 9, 2018

Initial port of RandomAccessStream/ReaderInfo drewnoakes/metadata-extractor#361

Open

kwhopper mentioned this issue May 8, 2023

Fix Exif thumbnail offset #333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomAccessStream + ReaderInfo #131

RandomAccessStream + ReaderInfo #131

kwhopper commented May 4, 2018 •

edited

Loading

kwhopper commented May 4, 2018

RandomAccessStream + ReaderInfo #131

RandomAccessStream + ReaderInfo #131

Comments

kwhopper commented May 4, 2018 • edited Loading

kwhopper commented May 4, 2018

kwhopper commented May 4, 2018 •

edited

Loading