-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for streaming operation #122
Comments
Hey James, Thanks for the report and the research. You're right, reading the length is a problem. I think this is the cause of #111 too. The line you highlight is only there to workaround rare and invalid images. It seems wrong to pay this cost for valid images. However there are several usages of Running the code across the sample image library, we see the following tally for times that reading the
1 This is actually a duplicate of L184, which is why it's never hit. I'll look at tidying that up a bit. The cost of ignoring these valdiation checks could be OOBE, OOME, or just garbage being produced in the output. It's the inability to filter garbage that makes disabling validation a problem to my mind. These numbers are just for TIFF too (Exif, TIFF, raw files, ...) and there are other usages of Unless I'm missing a creative solution, it just isn't possible to reliably process TIFF data in a streaming format. It would be possible to architect TIFF data so that it were possible, however in practice the TIFF standard makes no such guarantees around layout. In your case, can you work around the issue by providing the As an aside: the library's |
Hey Drew, thanks for the thoughtful and prompt reply! If the Length calls ultimately resulted in querying the Stream object's Length property, this might be a non-issue -- I could either use Content-Length, or just return an impossibly large number to ensure those checks pass. However, my existing implementation of Length throws a NotImplementedException, and I'm not seeing that get thrown -- perhaps MetadataExtractor is swallowing that exception somewhere and falling back to seeking to the end of the stream to determine its length? Regarding processing TIFF data in streaming mode, I agree that it's not possible in the general case -- TIFF's IFD offsets can point to anywhere. However, I submit that it's possible to do on most TIFF files you see in practice (annoyingly not all, but I'll return to that). I've implemented a streaming preview extractor with a minimal TIFF IFD parser that works like so:
In practice, most camera makers seem to opt to put the metadata at the start of the file so that you can do things like display thumbnails without having to load the entire raw file. However, that's a layer of complexity that I don't think MetadataExtractor needs to worry about. The Stream interface defines seeking, reading and length operations, which can all be implemented on top of such a buffering construct. It seems like the root issue might just be that Length on IndexedCapturingReader does not forward to the Stream instance it wraps. Perhaps something like this on IndexedCapturingReader would work: public override long Length
{
get
{
try
{
return _stream.Length;
}
catch(NotSupportedException)
{
IsValidIndex(int.MaxValue, 1);
Debug.Assert(_isStreamFinished);
return _streamLength;
}
}
} What do you think? |
…echanism if the stream doesn't support the Length operation. Addresses drewnoakes#122 by allowing for streamed parsing of metadata.
I tried this out and it seems to work. Just sent a PR with exactly that change. I tried to run the test-suite, but got stuck building it with the following message: |
Thanks for the PR. Left one comment to consider. As for the What is the output of the following command on your machine? $ dotnet --info
.NET Command Line Tools (2.1.4)
Product Information:
Version: 2.1.4
Commit SHA-1 hash: 5e8add2190
Runtime Environment:
OS Name: Windows
OS Version: 10.0.16299
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\2.1.4\
Microsoft .NET Core Shared Framework Host
Version : 2.0.5
Build : 17373eb129b3b05aa18ece963f8795d65ef8ea54 |
Arh, for some reason rebuilding now as part of the solution I was using to test with now fails with the same message.
For reference, my dotnet binary seems to have come from Microsoft's dotnet-host package:
For reference, when I tried to build the test program, I did so using
I note that I'm an older version of dotnet than you, so it's possible this is all a symptom of my updateophobia. I'll apply all updates and see if that changes things. |
This link has install instructions for Microsoft's official SDK package for Ubuntu 17.10. I believe |
Got it. Looks as though many of my sources were disabled in my last OS update. I've re-enabled everything (including the repos listed on that page). Doing system upgrade now and will then install the dotnet-sdk package. Will report back once that's all done. |
Ok, so the
However, when I run msbuild, I now get a different set of errors (633 of them, in fact), which look like they can't find the standard libraries. Eg:
There are similar errors for Linq and Dictionary among other fairly common types. I'm invoking msbuild like so, inspired by Build.ps1:
Not sure what's going on. |
Ok, so the mystery of msbuild failing is one I'll need to investigate further, however, I can now build MetadataExtractor.Tools.FileProcessor from my IDE (Rider). Running the tests before and after this change show that nothing new is broken. I've updated my branch with the requested change. |
…echanism if the stream doesn't support the Length operation. Addresses #122 by allowing for streamed parsing of metadata.
I wish I could be more help with the build problem but it's not something I have a tonne of familiarity with. If you do discover the problem/fix, please post it here for others too. Cheers again for the PR. I think this should make a difference to those using |
Randomly restarting things after updating eventually got me to a place where I could build, which is not terribly satisfying. However, I no longer have any issues here so I'm going to go ahead and close this. |
I may have been too hasty in closing this issue. Leaving the build issues aside (I can no longer repro them), it seems that there are some other ways that It seems as though Re-phrasing, it seems as though it has been engineered not to assume that streams are seekable (or perhaps to assume that seeking may be costly). Perhaps the best solution would be to allow the caller to pass in an instance of IndexedReader directly, rather than having one constructed when you pass in a Stream object? That way the caller could make the trade-offs appropriate for their problem. However, that would require a fairly substantial interface addition: the addition of a method to all of the *MetadataReader classes. So the two options I see for resolving this are:
I'll understand if this isn't something you want to have built into your library. It's kind of a must-have feature for me, so I'd end up maintaining a fork in that case (which is totally fine). I do think this is a valuable addition though, so I'm definitely interested in figuring out a way to build this into your tree if you're open to it. Thoughts? |
While it won't be immediately helpful in this case, I've been working from time to time on a version of MetadataExtractor that has a single reader for all cases. It assumes all Streams are seekable from the usage side, but if the underlying Stream isn't then it will use buffering mitigation inside the reader similar to the indexed reader. It also allows for other interesting features if we want them, such as easily recording the exact index of a segment or tag. Maybe it will work, maybe not - but it's a simpler reading model for sure and might be a basis for further revision to address this issue. In your case, if your stream wrapper inherits from Stream and reports that it's seekable then you would be able to pass it as the source of data and no extra buffering would be needed. The buffering part of that is not complete yet, although the seekable reader code is working for the basic Jpeg, Exif, and Tiff reader classes already in the library. I'll post it at some point as a brand new branch to get some feedback. I won't even try to move everything into it until the buffering portion is working. |
Allows verifying how far through the FileStream we seeked during extraction. Used in researching #122
@j4m3z0r you're right, there's more we can do here. Sounds like @kwhopper has some promising inroads underway here too. Two issues come to mind:
On the first point, I ran some experiments, modifying The results look good. For 1,357 files (2,058,362,410 bytes). Before: 251,282,085 bytes After: 42,560,067 bytes That's a big difference. Though there are some large raw files in there that might make up the bulk of it. I haven't broken it down by file type. Still, it can't hurt if you're trying to reduce the total amount read from the stream. What was interesting too was that performance actually went down for the entire scan by about 20%. Before:
After:
These runs were against files on a reasonably snappy SSD, so other sources (network/slower disks) might see an improvement. I think the reduction in performance can be attributed to reduced buffering. Loading data into memory in chunks before using it can't hurt. However I haven't profiled this so it's pure speculation really. With this change there's definitely less allocation/pressure on the GC. I think I'll commit and push them. The current situation with different reader classes was inherited from the original implementation in Java. Java has a different model of streams, and I was also a much less experienced developer 15 years ago. There's room for improvement in both language implementations. The .NET version could likely take better advantage of the SDK's APIs. And on that topic there are some very interesting changes coming to .NET that'll allow some nice abstractions over memory ( I'll push up the change that causes fewer bytes to be read. Test it out if you get a chance. |
Ok that didn't work at all. I was focussing on the wrong thing. I will have to think about this another time. I've rolled back master for now. |
So I just put together a PR that implements this and works for what I need, though it is not without some compromises. See here: I've done it so that Streams that don't support Length or Seeking fallback to (an approximation of) the original algorithm of pre-fetching the whole file. On my system, running the FileProcessor test suite per the contribution instructions takes about 2000ms now, whereas it was 1300ms before on this system. I spent some time trying to recover the lost performance, but I wasn't able to get it to be as fast as the original version. Some interesting observations:
I also ended up removing _isStreamFinished, and there were a few other tweaks. I don't think I made it less correct, and the test-suite is unchanged. Anyway, the impact for me is pretty huge: I can now get RAW file metadata with a few hundred KB of network traffic as opposed to 10s of MBs. Take a look and see what you think. |
I'm a bit confused by GitHub's messages about what can and can't be merged, so I'm unsure if the HEAD of my branch is the appropriate one to look at. To allay any confusion, this is the pertinent changeset: |
Don't worry too much about the conflicts. I can sort that out manually. I'm guessing it happened because your PR #123 was from your master branch, but when I merged it I rebased it. So then your master and the upstream master were out of sync before you started the next PR. But if you're going to spend time on anything, it'd be more useful to spend it on the unit tests. |
I just started taking another look at this, but I'm running into trouble building the project, and I'm getting the same issue on a clean checkout of master. Here's the error I'm seeing:
I'll keep poking at this, but if this is a known issue with an easy fix, I'd love to hear about it. This is on Visual Studio on Mac, which updated itself just this morning. Cleaning the solution and rebuilding doesn't seem to change anything. |
Ok, I found a workaround: delete the reference to the XmpCore.StrongName package. I also had to change the version of the test project to .net core 2.0. I'll omit both these changes once I get to putting together a PR. |
…s repeats the IndexedCapturingReaderTests, for each of the permutations of seekable and length supported.
Ok, I just pushed some changes to the PR I made:
I've actually not used xunit before, so I hope I got that right -- let me know if not. The integration with VS is pretty neat! :) |
Hi there, I'm trying to use MetadataExtractor on an input that is being streamed over the network. I've implemented a System.IO.Stream subclass that reads data from the network and buffers data that has already been read to allow seeking to work.
Unfortunately, MetadataExtractor seems to read in the entire file before processing any of the metadata. It looks like it might be this line that is the culprit:
https://github.com/drewnoakes/metadata-extractor-dotnet/blob/master/MetadataExtractor/Formats/Tiff/TiffReader.cs#L67
I note that there is already a comment there about this exact issue with a TODO on it. Is there some way to fix this without breaking anything? Alternatively, how about adding an option to allow for stream-based processing and just skip that check when it's set?
The text was updated successfully, but these errors were encountered: