-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hocr-pdf: change encoding from latin1 to utf-8 #171
Comments
@kba I have gone ahead and mitigated the issue by using the version uploaded to Github instead of PyPI. It seems like the Github Version is more up to date for hocr-tools and the PyPI version needs to be updated. You can go ahead and see that in the following output.pdf file. Though I do have some questions for you in regards to adding spacing in hOCR files so when we compile using hocr-pdf and do Ctrl+C and Ctrl+V in the PDF file result spaces must be preserved. I am attaching the PDF file for reference. Click here to see output_2.pdf |
Question? @kbaHow to add word spacing for utf-8 Languages like Arabic in hOCR format. Here is the current preview of the Old Output using hOCR formatThis is the output coming straight from the Tesseract OCR without any thing applied. In the above example, I had to go ahead and add manual spaces in the hOCR tages like. Manual hOCR SpacesMeaning I am adding the manual hOCR Spaces is writing space in the Result of Manual SpacingThe Result seems to be what is desired and expected.
وأماثانيا فلأنه يخرج منه من زنى مثلا ثممبٌدك فإنه
Default Tesseract hOCR Outputwhereas the default result by tesseract is something like
So are adding Spaces Needed or is this Just the Case of UTF-8 Languages?Meaning in Latin Scripts do I have to add spaces, or is this something tesseract does not need to do in case of latin scripts like in case of english words. Does it work just fine in your testing? Or does this bug need to be tackled in tesseract hocr file creating for Arabic and other languages that have the same problem using compiling with Arabic. Contributing to WikipediaI just contributed to the hOCR page on Wikipedia, with some of the latest information about making a searchable pdf file, I guess we also need to add proper syntax usage of hOCR format, as finding that is really something you need to get your hands dirty in order to do so. |
Does someone has a detailed way how to change the encoding to utf-8. In my example, in the hocr file, the text is: "Kötnerho..." and I always get "KÅtnerho..." in the PDF, using hocr2pdf. I also get "GauÄstrasse" instead of "Gaußstrasse". |
Possibly, I have never used
hocr-pdf
with non-latin texts - what happens when you do?Originally posted by @UBISOFT-1 in #170 (comment)
The text was updated successfully, but these errors were encountered: