Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback python-based .rtf extraction #88

Open
deanmalmgren opened this issue Jun 23, 2015 · 6 comments
Open

fallback python-based .rtf extraction #88

deanmalmgren opened this issue Jun 23, 2015 · 6 comments

Comments

@deanmalmgren
Copy link
Owner

This is currently using the unrtf command line tool, but it would be nice to have a pure python extraction method as a fallback.

@pombredanne
Copy link

@deanmalmgren Have a look at https://github.com/brendonh/pyth

@deanmalmgren
Copy link
Owner Author

Nice find, @pombredanne; thanks for pointing this out!

If anyone is interested in taking a crack at this, the pdf parser has a pretty good example for how to have multiple methods for extracting text from documents.

@jpadilla
Copy link

@deanmalmgren I'll try to get this done in a few.

@deanmalmgren
Copy link
Owner Author

Awesome, thanks @jpadilla!

@jpadilla
Copy link

For one thing, pyth seems to have issues with charsets on the existing RTF files in textract which will make it harder to test. Might be related to brendonh/pyth#30

@pombredanne
Copy link

@jpadilla indeed, and I ran pyth through the test files from unrtf and submitted a ticket there: brendonh/pyth#34
Which is an opportunity to help make @brendonh pyth better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants