You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This library has had numerous issues because of a dependence on byte arrays for internal processing. Segments are generally read into a byte buffer for parsing out tags and values and any sub-ifds or segments are read into other byte arrays. Because of this, the global/relative position of those bytes has been hard to track. We have tried adding things like Offset to JpegSegment, but the meaning of that value can get lost. This makes it difficult to use offset-based processing, like saving out thumbnails, and makes it virtually impossible to have the location accuracy for writing metadata back.
One possible solution occurred to me while looking through ExifTool source code. While it does use a ton of byte arrays, the properties of the current operation are stored in an object to tell processors downstream where to start and the limitations. The new ReaderInfo class brings that general idea to this library. It has these properties:
Always knows its own global StartPosition inside the stream or array
Tracks its own Length
Supports index- and sequential-based processing by tracking LocalPosition relative to StartPosition
GlobalPosition is always known by adding StartPosition to LocalPosition
Can be cloned with the current properties, relative to the current position, etc. Even if the new instance has the exact same properties, it is independent from that point forward
The need for byte buffers downstream is greatly reduced if this object is passed along
Doesn't know or care how the bytes are read by deferring to a RandomAccessStream
The last one brings up the other piece that also makes this possible. RandomAccessStream intends to replace all other readers in the current library. It's largely a buffered, indexed capturing reader implementation but takes into account whether the stream it's wrapping supports seeking. It reads in chunks with a (configurable in the future) length. If the stream is seekable, the reader seeks to n*chunklength index and reads. If the stream isn't seekable and a chunk greater than the first is requested, all chunks necessary to reach the desired chunk are read first (desirable for forward-only network streams, etc).
There are other things to say probably, but I'll skip to some stats. I ran the current metadata-extractor-dotnet FileProcessor on the images library and got back this:
Processed 1,359 files (read 252,143,934 of 2,061,394,286 bytes) with 228 exceptions and 594 file errors
Completed in 11,023.44 ms
The new code returned this:
Processed 1,359 files (read 88,054,811 of 2,061,394,286 bytes) with 214 exceptions and 593 file errors
Completed in 5,191.71 ms
Bytes read and processing time are much improved. Please let me know what you think. Again, the code is in PR #130
The text was updated successfully, but these errors were encountered:
Would like some review and comments on #130.
This library has had numerous issues because of a dependence on byte arrays for internal processing. Segments are generally read into a byte buffer for parsing out tags and values and any sub-ifds or segments are read into other byte arrays. Because of this, the global/relative position of those bytes has been hard to track. We have tried adding things like Offset to JpegSegment, but the meaning of that value can get lost. This makes it difficult to use offset-based processing, like saving out thumbnails, and makes it virtually impossible to have the location accuracy for writing metadata back.
One possible solution occurred to me while looking through ExifTool source code. While it does use a ton of byte arrays, the properties of the current operation are stored in an object to tell processors downstream where to start and the limitations. The new ReaderInfo class brings that general idea to this library. It has these properties:
The last one brings up the other piece that also makes this possible. RandomAccessStream intends to replace all other readers in the current library. It's largely a buffered, indexed capturing reader implementation but takes into account whether the stream it's wrapping supports seeking. It reads in chunks with a (configurable in the future) length. If the stream is seekable, the reader seeks to n*chunklength index and reads. If the stream isn't seekable and a chunk greater than the first is requested, all chunks necessary to reach the desired chunk are read first (desirable for forward-only network streams, etc).
There are other things to say probably, but I'll skip to some stats. I ran the current metadata-extractor-dotnet FileProcessor on the images library and got back this:
Processed 1,359 files (read 252,143,934 of 2,061,394,286 bytes) with 228 exceptions and 594 file errors
Completed in 11,023.44 ms
The new code returned this:
Processed 1,359 files (read 88,054,811 of 2,061,394,286 bytes) with 214 exceptions and 593 file errors
Completed in 5,191.71 ms
Bytes read and processing time are much improved. Please let me know what you think. Again, the code is in PR #130
The text was updated successfully, but these errors were encountered: