Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line images using HOCR #160

Open
rraina97 opened this issue Jun 4, 2020 · 2 comments
Open

Line images using HOCR #160

rraina97 opened this issue Jun 4, 2020 · 2 comments

Comments

@rraina97
Copy link

rraina97 commented Jun 4, 2020

I want to create training data (line images and coresponding .txt files)for arabic languages using arabic documents. I used tesseract with hocr to create a hocr file and then used hocr-extract-images to get line data. But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model). Is there any other method to create line images which can be used to train tesseract and thus imporve its accuracy.

@zvezdochiot
Copy link

@kba
Copy link
Contributor

kba commented Jun 4, 2020

But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model)

IIUC you are not happy with the line segmentation? In that case you should indeed investigate tesseract and the documentation. There's also other tools for line segmentation, in ocropy, sbb_textline_detection, and several implementations in OCR-D.

Once you have an hOCR file with the right segmentation, we can support you with the right hocr-tools invocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants