You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to create training data (line images and coresponding .txt files)for arabic languages using arabic documents. I used tesseract with hocr to create a hocr file and then used hocr-extract-images to get line data. But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model). Is there any other method to create line images which can be used to train tesseract and thus imporve its accuracy.
The text was updated successfully, but these errors were encountered:
But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model)
IIUC you are not happy with the line segmentation? In that case you should indeed investigate tesseract and the documentation. There's also other tools for line segmentation, in ocropy, sbb_textline_detection, and several implementations in OCR-D.
Once you have an hOCR file with the right segmentation, we can support you with the right hocr-tools invocation.
I want to create training data (line images and coresponding .txt files)for arabic languages using arabic documents. I used tesseract with hocr to create a hocr file and then used hocr-extract-images to get line data. But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model). Is there any other method to create line images which can be used to train tesseract and thus imporve its accuracy.
The text was updated successfully, but these errors were encountered: