-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fallback python-based .rtf extraction #88
Comments
@deanmalmgren Have a look at https://github.com/brendonh/pyth |
Nice find, @pombredanne; thanks for pointing this out! If anyone is interested in taking a crack at this, the pdf parser has a pretty good example for how to have multiple |
@deanmalmgren I'll try to get this done in a few. |
Awesome, thanks @jpadilla! |
For one thing, pyth seems to have issues with charsets on the existing RTF files in textract which will make it harder to test. Might be related to brendonh/pyth#30 |
@jpadilla indeed, and I ran pyth through the test files from unrtf and submitted a ticket there: brendonh/pyth#34 |
This is currently using the
unrtf
command line tool, but it would be nice to have a pure python extraction method as a fallback.The text was updated successfully, but these errors were encountered: