-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corrupted data when generating a searchable pdf with hocr-pdf #186
Comments
Which version of |
pprw, i am having the same issue with these symbols instead of normal text. |
Does it work with |
Sorry for the late reply.
I installed reportlab .0.9 and master version of hocr-tools
I have commented line 30 and 116 of hocr-pdf file because of an error about bidi library.
I opened a specific issue about this. #188 So maybe it is related to this. I am trying to fix the bidi error and will see after that if there is any change. |
This most likely is the same issue as in #188 (comment), id est you are not using
should indicate that you are indeed installing/using the latest |
Thank for the comment. I reinstalled hocr-tools without using pipx and in the same environment
|
Because you are using |
Sorry, I noticed that just after commenting. With I have a pdf with a readable text layout. pdf2txt complains still about corrupted data
displays the text Evince (pdf reader) complains a lot with "some font thing failed" when reading the pdf but search works |
I have not validated other tools further, but you might want to have a look at https://github.com/stefan6419846/hocr-tools which fixes both the compatibility with recent reportlab versions and includes #178 which might fix some of these aspects. |
I think my problem is related to accent support. The recognized text is in French and I cannot search accented letters in the output pdf create by hocr for example: "formé" in the hocr file is "formeÄ" in the output pdf I will try your fork |
I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.
I have both files in the same folder.
hocr-pdf . > out.pdf
generates a pdf but I cannot search inside.Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).
When I extract the text from the pdf
and out.txt contains (excerpt)
My hocr file is generated by kraken.
I read from kraken documentation
So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.
The text was updated successfully, but these errors were encountered: