Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrupted data when generating a searchable pdf with hocr-pdf #186

Open
pprw opened this issue Jul 3, 2024 · 11 comments
Open

corrupted data when generating a searchable pdf with hocr-pdf #186

pprw opened this issue Jul 3, 2024 · 11 comments

Comments

@pprw
Copy link

pprw commented Jul 3, 2024

I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.

I have both files in the same folder. hocr-pdf . > out.pdf generates a pdf but I cannot search inside.

Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).

When I extract the text from the pdf

$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

and out.txt contains (excerpt)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)

My hocr file is generated by kraken.

I read from kraken documentation

hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.

Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.

So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.

@stefan6419846
Copy link

Which version of reportlab are you using? As far as I am aware, reportlab>=4.1.0 breaks hocr-pdf.

@pprw
Copy link
Author

pprw commented Jul 5, 2024

Thanks for the information.

I was using reportlab 4.2.2. I downgraded to 4.0.9.

Now I do not have anymore the
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

but I cannot search inside the pdf and pdf2text creates a file filled with:

image

@misters2008
Copy link

pprw, i am having the same issue with these symbols instead of normal text.
Were you able to fix it by now?

@stefan6419846
Copy link

Does it work with pdftotext file.pdf -? At least during my testing, this would generate a PDF file with a valid text layer when using the hocr-tools master branch (due to unfixed issues in the release on Python 3.10) and using reportlab==4.0.9.

@pprw
Copy link
Author

pprw commented Oct 9, 2024

Sorry for the late reply.

pdftotext file.pdf - does not display anything.

I installed reportlab .0.9 and master version of hocr-tools

pipx install reportlab==4.0.9 --include-deps --force
pipx install git+https://github.com/ocropus/hocr-tools.git@master --force

I have commented line 30 and 116 of hocr-pdf file because of an error about bidi library.

line 30: from bidi.algorithm import get_display           
line 116:  rawtext = get_display(rawtext)

I opened a specific issue about this. #188

So maybe it is related to this. I am trying to fix the bidi error and will see after that if there is any change.

@stefan6419846
Copy link

This most likely is the same issue as in #188 (comment), id est you are not using pipx as your tool of choice correctly. hocr-tools currently does not pin reportlab to a compatible version, thus

pipx install git+https://github.com/ocropus/hocr-tools.git@master --force

should indicate that you are indeed installing/using the latest reportlab version for hocr-tools and not version 4.0.9.

@pprw
Copy link
Author

pprw commented Oct 15, 2024

Thank for the comment.

I reinstalled hocr-tools without using pipx and in the same environment

$ python3 -m venv $HOME/.venvs/hocr
$ source $HOME/.venvs/hocr/bin/activate
$  pip install hocr-tools
Collecting hocr-tools
  Using cached hocr_tools-1.1.1-py3-none-any.whl
Collecting Pillow
  Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 28.2 MB/s eta 0:00:00
Collecting lxml
  Using cached lxml-5.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (5.0 MB)
Collecting reportlab
  Using cached reportlab-4.2.5-py3-none-any.whl (1.9 MB)
Collecting chardet
  Using cached chardet-5.2.0-py3-none-any.whl (199 kB)
Installing collected packages: Pillow, lxml, chardet, reportlab, hocr-tools
Successfully installed Pillow-10.4.0 chardet-5.2.0 hocr-tools-1.1.1 lxml-5.3.0 reportlab-4.2.5

hocr-pdf . > output.pdf generates no error but the file is still not readable

$ pdftotext output.pdf -
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Invalid XRef entry 0
Syntax Error (2437693): Missing 'endstream' or incorrect stream length
Syntax Error (2436161): Bad FCHECK in flate stream
Syntax Error: Embedded font file may be invalid
Syntax Error (2436088): Missing 'endstream' or incorrect stream length
Syntax Error (2435010): Bad FCHECK in flate stream

@stefan6419846
Copy link

Because you are using reportlab==4.2.5. Please force reportlab==4.0.9.

@pprw
Copy link
Author

pprw commented Oct 15, 2024

Sorry, I noticed that just after commenting.

With
pip install reportlab==4.0.9 --force

I have a pdf with a readable text layout.

pdf2txt complains still about corrupted data

$ pdf2txt output.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

pdftotext output.pdf -

displays the text

Evince (pdf reader) complains a lot with "some font thing failed" when reading the pdf but search works

@stefan6419846
Copy link

I have not validated other tools further, but you might want to have a look at https://github.com/stefan6419846/hocr-tools which fixes both the compatibility with recent reportlab versions and includes #178 which might fix some of these aspects.

@pprw
Copy link
Author

pprw commented Oct 15, 2024

I think my problem is related to accent support. The recognized text is in French and I cannot search accented letters in the output pdf create by hocr

for example: "formé" in the hocr file is "formeÄ" in the output pdf

I will try your fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants