-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation: tesseract5.3.0 vs ocrd/all:2022-08-15 #346
Comments
Hard to tell from a diff tool that I don't know and data I cannot see. Looks like in ocrd-tesserocr two lines are duplicate. Binarization will have an impact, yes – both on segmentation and recognition. (For recognition, we don't currently pass the raw images, because we don't know what the model "wants". The only way to ensure raw recognition is either not have binarization in the workflow at all, or removing the respective annotation in the fileGrp that is used as input for recogition, for example via Mind that ocrd-tesserocr-segment plus ocrd-tesserocr-recognize is not recommended as it needlessly throws away internal information of the Tesseract layout analysis. (You can do segmentation and recognition in one pass with ocrd-tesserocr-recognize.) Standalone Tesseract is another beast entirely. It always uses the raw image for recognition. Also, you can now choose some new adaptive binarization via |
ok |
thanks @jbarth-ubhd for checking thoroughly! BTW, if you want to try any of the better Calamari 2 models here and there (probably also here), you currently have to switch to Calamari 2 on the standalone CLI. (In a OCR-D Workflow, this can be integrated by first exporting line images with |
Step 2 does not work (all pip modules installed without any conflict):
BTW the md5sum of all *.json is the same?
But the error message |
Note that there are more errors than in the first ocr-comparison-Image here #346 (comment) , but the base image is almost the same |
Ouch. With Python>=3.8 we are now heavily being hit by Calamari-OCR/calamari#78. The solution is to convert the models from HDF5 to SavedFormat, but you need a Python+TF version where it still loads in the first place. As a workaround, you can try in Python 3.7 or 3.6. |
Hard to tell. Tesseract LA is very buggy (I would even say fragile) and the legacy code has not been touched (maintained) for years... |
quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«" |
Yes, that's obviously true. But as a user you have no way of knowing what the model expects (raw or bin, what kind of bin). There's no model metadata in Tesseract. (And in Calamari, it could be stored in the model metadata, but the trainer does not do that.) The model's publisher (in this case, @stweil) must document what the model was trained on (both what kind of material and in what digital form). The Tesseract models from Mannheim are usually documented on the tesstrain Wiki. Their Kraken models however point to the wiki pages of the respective GT repos. |
I'm just wondering a bit about different recognition results using tesseract5.3.0 and OCR-D with
ocrd-olena-binarize && ocrd-tesserocr-segment
.Original TIF: https://digi.ub.uni-heidelberg.de/diglitData/v/heidelberg1592_-_04manual.tif
Result using tesseract5.3.0
-l Fraktur_GT4Hist...
(right column = ground truth)and using tesserocr-segment and calamari-recognize (
fraktur_historical1.0
) with OCR-D:and using tesserocr-segment and tesserocr-recognize (
Fraktur_GT4Hist...
) with OCR-D:It seems that OCR-D-"tesserocr" segmentation is somewhat different to OCR-D segmentation (perhaps because olena-binarize?), but I can't find a big change in line/region/segmentation etc. in the tesseract changelog the last year.
The text was updated successfully, but these errors were encountered: