Extract HOCR from searchable PDF #117

thwfqecj · 2017-08-01T12:14:04Z

Thank you so much with your great works!

But I wonder if it is possible to extract HOCR from searchable PDF, I mean, PDFs that are already combined with HOCR, I haven't find any tools to do that for me...

stweil · 2017-08-01T13:49:40Z

Nor do I know such tools. The tool pdftohtml can extract XML from PDF, and there is an issue for ocr-fileformats to convert that XML to hOCR, so that combined tools would do the job.

jsbien · 2017-08-01T14:14:05Z

Quote/Cytat - Stefan Weil <[email protected]> (wto, 1 sie 2017, 15:49:41):

Nor do I know such tools. The tool `pdftohtml` can extract XML from PDF, and there is an [issue for ocr-fileformats](UB-Mannheim/ocr-fileformat#57) to convert that XML to hOCR, so that combined tools would do the job.

Or you can convert PDF to DjVu and export hOCR from DjVu with Jakub Wilk's tools https://jwilk.net/software/pdf2djvu https://jwilk.net/software/ocrodjvu Regards Janusz

…

-- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) [email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/

thwfqecj · 2017-09-05T09:00:35Z

@stweil @jsbien

Thanks for your comment!
I'm trying to use pdftohtml. Actually I want to make my search-able pdf slimmer. It seems I have to change all the pages to single html files and then back to pdf...But it works!

As for ocrodjvu,

ocrodjvu is a wrapper for OCR systems, that allows you to perform OCR on DjVu files.

It looks like a OCR tool, not fitting my needs. But Thank you.

jsbien · 2017-09-05T09:07:25Z

ocrodjvu is distributed with djvu2hocr which is what you may need,

giancarlobi · 2017-09-12T11:52:00Z

Many thanks @jsbien pdf2djvu + djvu2hocr works like a charm !!

thwfqecj · 2017-10-18T08:58:12Z

@jsbien Thank you for your kindly reply! Sorry for I didn't see ocrodjvu on its official site...Guess it should works!

jsbien · 2017-10-18T09:45:08Z

Quote/Cytat - thwfqecj <[email protected]> (Wed 18 Oct 2017 10:58:13 AM CEST):

@jsbien Thank you for your kindly reply! Sorry for I didn't see ocrodjvu on its official site...Guess it should works!

What do you mean by its official site? For me it is https://jwilk.net/software/ocrodjvu

JensHumrich · 2019-01-31T15:19:36Z

Hey,
I stumbled upon this old thread. I can confirm that the solution works...

pdf2djvu -o test.djvu test.pdf
python2 /mnt/mem/temp/ocrodjvu/ocrodjvu test.djvu -o ocrfile
python2 /mnt/mem/temp/ocrodjvu/djvu2hocr ocrfile > output.hocr

jsbien · 2019-01-31T15:43:14Z

For a searchable PDF the second step should be skipped, otherwise instead of the original text you get the result of OCR.

JensHumrich · 2019-01-31T17:05:50Z

Wow. Thanks a lot. This is really an important information.

mattdeeperinsights · 2021-10-12T16:05:56Z

I would recommend using Python package pdftotree to get the hocr automatically, it's so easy.

Get requirements:

Get latest Java (8+) if you don't already have it
Get latest ImageMagick

Pip the package: pip3 install pdftotree and then it's as simple as this:

import pdftotree
hocr_result = pdftotree.parse('path/to/your.pdf')

Enjoy.

rmast · 2022-01-08T13:32:05Z

Hey, I stumbled upon this old thread. I can confirm that the solution works...

pdf2djvu -o test.djvu test.pdf
python2 /mnt/mem/temp/ocrodjvu/ocrodjvu test.djvu -o ocrfile
python2 /mnt/mem/temp/ocrodjvu/djvu2hocr ocrfile > output.hocr

I can now say it doesn't work for either a PDF or a DjVu with searchable text coming from GScan2PDF.
Te resulting ocrfile is still big and containing pages that are mentioned by djvu2hocr, however the resulting output.hocr contains nothing. The contents of the file look like a djvu, and renamed to djvu are viewable by a djvu-viewer. They show no hidden text.

To get the HOCR from the searchable DjVu just apply djvu2hocr on the djvu and skip ocrodjvu

stweil added enhancement question labels Aug 1, 2017

rmast mentioned this issue Jan 8, 2022

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF internetarchive/archive-pdf-tools#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract HOCR from searchable PDF #117

Extract HOCR from searchable PDF #117

thwfqecj commented Aug 1, 2017

stweil commented Aug 1, 2017

jsbien commented Aug 1, 2017 via email

thwfqecj commented Sep 5, 2017

jsbien commented Sep 5, 2017

giancarlobi commented Sep 12, 2017

thwfqecj commented Oct 18, 2017

jsbien commented Oct 18, 2017 via email

JensHumrich commented Jan 31, 2019

jsbien commented Jan 31, 2019

JensHumrich commented Jan 31, 2019

mattdeeperinsights commented Oct 12, 2021 •

edited

Loading

rmast commented Jan 8, 2022 •

edited

Loading

Extract HOCR from searchable PDF #117

Extract HOCR from searchable PDF #117

Comments

thwfqecj commented Aug 1, 2017

stweil commented Aug 1, 2017

jsbien commented Aug 1, 2017 via email

thwfqecj commented Sep 5, 2017

jsbien commented Sep 5, 2017

giancarlobi commented Sep 12, 2017

thwfqecj commented Oct 18, 2017

jsbien commented Oct 18, 2017 via email

JensHumrich commented Jan 31, 2019

jsbien commented Jan 31, 2019

JensHumrich commented Jan 31, 2019

mattdeeperinsights commented Oct 12, 2021 • edited Loading

rmast commented Jan 8, 2022 • edited Loading

mattdeeperinsights commented Oct 12, 2021 •

edited

Loading

rmast commented Jan 8, 2022 •

edited

Loading