-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract HOCR from searchable PDF #117
Comments
Nor do I know such tools. The tool |
Quote/Cytat - Stefan Weil <[email protected]> (wto, 1 sie 2017,
15:49:41):
Nor do I know such tools. The tool `pdftohtml` can extract XML from
PDF, and there is an [issue for
ocr-fileformats](UB-Mannheim/ocr-fileformat#57) to
convert that XML to hOCR, so that combined tools would do the job.
Or you can convert PDF to DjVu and export hOCR from DjVu with Jakub
Wilk's tools
https://jwilk.net/software/pdf2djvu
https://jwilk.net/software/ocrodjvu
Regards
Janusz
…--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/
|
Thanks for your comment! As for ocrodjvu,
It looks like a OCR tool, not fitting my needs. But Thank you. |
ocrodjvu is distributed with djvu2hocr which is what you may need, |
Many thanks @jsbien pdf2djvu + djvu2hocr works like a charm !! |
@jsbien Thank you for your kindly reply! Sorry for I didn't see ocrodjvu on its official site...Guess it should works! |
Quote/Cytat - thwfqecj <[email protected]> (Wed 18 Oct 2017
10:58:13 AM CEST):
@jsbien Thank you for your kindly reply! Sorry for I didn't see
ocrodjvu on its official site...Guess it should works!
What do you mean by its official site? For me it is
https://jwilk.net/software/ocrodjvu
|
Hey, pdf2djvu -o test.djvu test.pdf |
For a searchable PDF the second step should be skipped, otherwise instead of the original text you get the result of OCR. |
Wow. Thanks a lot. This is really an important information. |
I would recommend using Python package pdftotree to get the hocr automatically, it's so easy. Get requirements:
Pip the package: import pdftotree
hocr_result = pdftotree.parse('path/to/your.pdf') Enjoy. |
I can now say it doesn't work for either a PDF or a DjVu with searchable text coming from GScan2PDF. To get the HOCR from the searchable DjVu just apply djvu2hocr on the djvu and skip ocrodjvu |
Thank you so much with your great works!
But I wonder if it is possible to extract HOCR from searchable PDF, I mean, PDFs that are already combined with HOCR, I haven't find any tools to do that for me...
The text was updated successfully, but these errors were encountered: