Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hocr-pdf: change encoding from latin1 to utf-8 #171

Open
kba opened this issue Dec 13, 2021 · 3 comments
Open

hocr-pdf: change encoding from latin1 to utf-8 #171

kba opened this issue Dec 13, 2021 · 3 comments

Comments

@kba
Copy link
Contributor

kba commented Dec 13, 2021

Also what will happen if we go ahead and change the encoding from 'latin-1' to 'utf-8' would that help if we are dealing with lets say Arabic Typescript.

Possibly, I have never used hocr-pdf with non-latin texts - what happens when you do?

Originally posted by @UBISOFT-1 in #170 (comment)

@UBISOFT-1
Copy link

UBISOFT-1 commented Dec 13, 2021

@kba I have gone ahead and mitigated the issue by using the version uploaded to Github instead of PyPI. It seems like the Github Version is more up to date for hocr-tools and the PyPI version needs to be updated.
In regards to using the utf-8 based content in Languages like Hebrew or Arabic or Sindhi, that use the RTL (Right to Left) Writing it seems to work fine after R2L Update.

You can go ahead and see that in the following output.pdf file. Though I do have some questions for you in regards to adding spacing in hOCR files so when we compile using hocr-pdf and do Ctrl+C and Ctrl+V in the PDF file result spaces must be preserved.

I am attaching the PDF file for reference. Click here to see output_2.pdf

@UBISOFT-1
Copy link

UBISOFT-1 commented Dec 13, 2021

Question? @kba

How to add word spacing for utf-8 Languages like Arabic in hOCR format.

Here is the current preview of the .hocr file used to compile the following output_2.pdf file as seen in the thread above, should we go ahead and use the ocrx_word tag. And how must the .hocr be compiled speaking in terms on non-latin based characters so that we can easily go ahead and do copy paste from pdf to a text editor with the paragraphs and other metadata preserved.

Old Output using hOCR format

This is the output coming straight from the Tesseract OCR without any thing applied. In the above example, I had to go ahead and add manual spaces in the hOCR tages like.

Manual hOCR Spaces

Meaning I am adding the manual hOCR Spaces is writing space in the >space wordspace<**
<span class='ocrx_word' id='word_1_1' title='bbox 1073 213 1145 316; x_wconf 90'> وأما </span>

Result of Manual Spacing

The Result seems to be what is desired and expected.

وأماثانيا فلأنه يخرج منه من زنى مثلا ثممبٌدك فإنه
لا يتأتى منهغير الندمعلى مامضئ وأما العزم على عدم
العود فلا يتصورمنهقال وبهذا اغتر من.قال إن الندم
يكفي فيحد التوبة وليس كما قال لأنه لوندم ولميقلع
وعزم على العود لم يكن تائبااتفاقا قال وقال بعض النحققين
هي اختيارترك ذنب سبق حقيقة.أوتقديرالأجل الله قال
وهذا أُسَدّ العبارات ”وأجمعها لأن التائب لا يكون تاركا
للذنب الذي فرغ لأنهغير متمكن من عينه لاتركا ولافعلا
وإنماهومتمكنمنمثلهحقيقة وكذا من لم يقعمنهذنب
إنما يصحمنه اتقاء ما يمكن أن.يقع لا ترك مثل ما وقع فيكون
متقيا لا تائبا قال والباعث على هذا تنبيهإلهئ لمن أراد
سعادته لقبح الذنب وضرره لأنهسم مهلكيُمَرَتْعلى
الإنسان سعادة الدنيا والآخرة. ويحجبهعنمعرفة الله.تعالىفي
الدنيا وعنتقريبه في الآخرة
قال .:ومنتفقدنفسه وجدها مشحونة بهذا السم فإذا وفق
انبعثمنهخوف هجوم الهلاك عليه فيبادر بطلب ما يدفع

Default Tesseract hOCR Output

whereas the default result by tesseract is something like
<span class='ocrx_word' id='word_1_1' title='bbox 1073 213 1145 316; x_wconf 90'>وأما</span>


1
وأماثانيا:فلأنهيخرجمنهمنزنىمثلاثممبٌدكفإنهلايتأتىمنهغيرالندمعلىمامضئ»وأماالعزمعلىعدمالعودفلايتصورمنه»قال:وبهذااغترمن.قال:إنالندميكفيفيحدالتوبة»وليسكماقال؛لأنهلوندمولميقلعوعزمعلىالعودلميكنتائبااتفاقا»قال:وقالبعضالنحققين:هياختيارتركذنبسبقحقيقة.أوتقديرالأجلاللهقال:وهذاأُسَدّالعبارات“وأجمعهالأنالتائبلايكونتاركاللذنبالذيفرغلأنهغيرمتمكنمنعينهلاتركاولافعلا»وإنماهومتمكنمنمثلهحقيقة»وكذامنلميقعمنهذنبإنمايصحمنهاتقاءمايمكنأن.يقعلاتركمثلماوقعفيكونمتقيالاتائبا»قال:والباعثعلىهذاتنبيهإلهئلمنأرادسعادتهلقبحالذنبوضرره؛لأنهسممهلكيُمَرَتْثعلىالإنسانسعادةالدنياوالآخرة.ويحجبهعنمعرفةالله.تعالىفيالدنيا»وعنتقريبهفيالآخرة.قال:.ومنتفقدنفسهوجدهامشحونةبهذاالسمفإذاوفقانبعثمنهخوفهجومالهلاكعليه»فيبادربطلبمايدفع

So are adding Spaces Needed or is this Just the Case of UTF-8 Languages?

Meaning in Latin Scripts do I have to add spaces, or is this something tesseract does not need to do in case of latin scripts like in case of english words. Does it work just fine in your testing? Or does this bug need to be tackled in tesseract hocr file creating for Arabic and other languages that have the same problem using compiling with Arabic.

Contributing to Wikipedia

I just contributed to the hOCR page on Wikipedia, with some of the latest information about making a searchable pdf file, I guess we also need to add proper syntax usage of hOCR format, as finding that is really something you need to get your hands dirty in order to do so.

@MaxIhme
Copy link

MaxIhme commented Sep 26, 2022

Does someone has a detailed way how to change the encoding to utf-8. In my example, in the hocr file, the text is: "Kötnerho..." and I always get "KÅtnerho..." in the PDF, using hocr2pdf. I also get "GauÄstrasse" instead of "Gaußstrasse".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants