Line images using HOCR #160

rraina97 · 2020-06-04T06:22:35Z

I want to create training data (line images and coresponding .txt files)for arabic languages using arabic documents. I used tesseract with hocr to create a hocr file and then used hocr-extract-images to get line data. But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model). Is there any other method to create line images which can be used to train tesseract and thus imporve its accuracy.

zvezdochiot · 2020-06-04T10:13:44Z

@rraina97, goto https://github.com/tesseract-ocr/tesseract !

kba · 2020-06-04T12:10:38Z

But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model)

IIUC you are not happy with the line segmentation? In that case you should indeed investigate tesseract and the documentation. There's also other tools for line segmentation, in ocropy, sbb_textline_detection, and several implementations in OCR-D.

Once you have an hOCR file with the right segmentation, we can support you with the right hocr-tools invocation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line images using HOCR #160

Line images using HOCR #160

rraina97 commented Jun 4, 2020

zvezdochiot commented Jun 4, 2020

kba commented Jun 4, 2020

Line images using HOCR #160

Line images using HOCR #160

Comments

rraina97 commented Jun 4, 2020

zvezdochiot commented Jun 4, 2020

kba commented Jun 4, 2020