-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mimetype detection as backup when extension not present #121
base: master
Are you sure you want to change the base?
Conversation
Also, the reason I took all the test files and put them in a |
I took a crack at this in #89 and @frbapolkosnik attempted something similar in #99. I certainly like the idea of doing something like this a lot. When I was experimenting with #89, I was really disappointed at how poorly the mimetype guessing worked in practice (not unlike what you discovered). In your case, you mentioned that "attachments that are not saved with the proper filename extension". That is particularly troubling because it would mean that I wonder if it would make sense to open up a separate API endpoint like |
Somehow I completely missed both of those PRs, oops! Obviously it's your call, but the thing I like about textract's design is that is solves the 80% case and deals with triaging for the user. What do you think about the method on which I settled, in which we use the extension if it exists, otherwise it uses mimetype detection? Using this method in my personal project, I was able to detect 100% of the word / pdf documents I was downloading with Scrapy that had no extensions attached. So I think overall users who have extensionless files will still see a qualitative improvement in the results that they get. |
@akoumjian I am spectacularly embarrassed by how stale I let this get. Eek... With #138 in progress, it would be great to also get this resolved as well if we can. These two PRs certainly complement one another. #138 gives people the ability to specify a parsing method whereas this PR gives us the ability to provide educated guesses based on the mimetype as a fallback method when the extension isn't available (I like this approach a lot!). I have a couple of big picture comments that would be great to resolve to get this merged in:
|
One more thing... based on the conflicting files, I think the merge will go a lot more smoothly if you rebase this off of |
I'll try to take a look at it this weekend based on your feedback.
…On Fri, Mar 24, 2017 at 4:09 AM Dean Malmgren ***@***.***> wrote:
One more thing... based on the conflicting files, I think the merge will
go a lot more smoothly if you rebase this off of master.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAszJCmoErqtuLQKLyIpFYC1wCWDqNqNks5ro6R0gaJpZM4JpHeD>
.
|
hey @akoumjian, i just thought i'd let you know that #138 was merged into master in v1.6.0. excited to see this come together for the next version of textract :) |
The following code works nicely on my end. I'll try to include this in the textract code. from . import exceptions
|
Some background: My use case for textract is with web scraping. Often, I will download attachments that are not saved with the proper filename extension. When this happens, textract currently defaults to the txt parser and normally it fails because these are pdfs, word docs, etc.
By adding support for python-magic, which uses the unix file command under the hood, I am able to successfully guess roughly half of these files. Still not where I want it, but it works. You'll see the tests are not yet passing because of this. However, I wanted to open the PR to discuss whether this was a good addition or not.
I've looked into other file type detection possibilities, and was surprised to discover that there aren't really any good heuristic based file detection libraries out there.
If you think this is a good addition, I'd love suggestions on how to properly tests / handle the fact that only about half of the filetypes are detected correctly. I could write very explicit tests per file that simply accept what is currently working and what isn't.