-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frak models in ocrd resmgr #404
Comments
Which one of the two models is "better", and how did you compare them? |
Comparison in sense of "check if the model files have the same content". |
That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms, nearly half of which are made of strange punctuation characters indicative of absent tokenisation, and the actual tokens are clearly scraped off the web, not historic at all). I would understand if the wordlist from @stweil can you explain? |
Because of the additional components the file Typically models with a (ideally domain specific) wordlist can achieve slightly higher recognition rates, but sometimes it can also lead to OCR results which differ from the printed text. And yes, this word list contains a lot of entries which should be removed. That's inherited from all standard Tesseract word lists. |
No, the latter word list is about twice the size, also with texts from the web, but contains none of these strange words with punctuation (non-tokenised), and does contain Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.) Regardless, the word list in that model files looks exceptionally bad (much worse than the Tesseract word lists) and should be improved. |
I have now distilled a list of full forms, capping at different frequencies, respectively:
I filtered by part-of-speech, removing punctuation, numbers and non-words (XY): select trim(u,'"') from csv where f > 100 and p != "$(" and p != "$," and p != "$." and p != "FM.xy" and p != "CARD" and p != "XY"; Furthermore, I removed those entries which have not been properly tokenised (indicated by leading punctuation) or are merely numbers (but still do not get p=CARD): grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$' The quality is very good! Maybe I'll also recompose the number and punc DAWGs for the additional historic patterns (e.g. I will try to use this with frak2021, but also GT4HistOCR and others. I guess I'll do some recognition experiments and evaluation before publishing the modified models. |
Done: see
|
In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts. |
sure, that's why it's among the models I build the dict into – see full list of assets Some evaluation (which material which model whether dict or not and which cap freq preferable) will follow. |
Here my small tool for checking the wordlists of .traineddata files: https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20 |
@bertsky : but frak2021_dta10+100 do not contain »ſ«: AMBIGIOUS (EXCERPT): 1sten A/ AP. As. AZ. Basalt- Bauers- Besitz- Bietsch- c. cas. Centralbl. Chrysost. cl. Corn. dial. Diener. Ding- Dinge. Ebd. Eigen- eigentl. Eisen. euch. Eurip. fgm. FML. fundam. g1 Gebiets- Geitz- Generals- G.n GOtts Griseb. Haubt- haus- HErre hsg. inst. Jahrbb. Jungfrau- k. Kg. Kiefer- Lactant. lap. legit. Loose. Magdalenen- Mai- Mehl= Namen. nat. neu- NJmb Normal- O1 Pall. pan. Pfand- Pfl. proc. Reb- redet. Rev. Rhodigin. Rich. Roman- Sc. Schulen. Schweine- Sed. SEin SJndt Spargel- Spitz- Strom. Syllog. Trauben- Trav. Trias- Trift- VIEUSSENS. VVilliam Wach- W.-B. wohl- Wolf. XCVII. y2 Ztg. zwei-
|
Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate! |
Ok, I found the problem. See new release.
What's with the > 100% BTW? |
>100% is because I've inspected only every 1/0.003th word, to keep output compact and multiply the count - I'll have a look at this. |
Just inspected frak2021_dta50.traineddata. Ambigious:
a lot of spaces after words(?). And not NFC (double counting, my bug.) Spaces were not in frak2021_dta10/100 I've downloaded till Jan 30 11:55. |
now with much nicer output: welchẽ␣␣␣(not NFC) Welchs␣␣␣␣ weñ␣␣␣␣␣␣(not NFC) Weñ␣␣␣␣␣␣(not NFC) wenigen␣␣␣ |
wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated! see new release
Do we really want that? (Even if DTA decided not to do it?) |
If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check. |
I just checked: tesstrain does NFC on the input GT (via It is also used in most CER measurement tools. I feel obliged to comply with this obvious convention in the OCR space. |
There we go |
Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica. And I already have a Tesseract branch which no longer requires box and lstmf files for the training. |
Right, but you can choose,
And it's configured differently for various mother tongues. So my fixed NFC in the DTA LM was premature is what you are saying @stweil? |
No, my comment was just meant as an information for you. |
with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact. |
So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)? That's not the kind of impact which is desired. |
me too. But the averages do go down overall (if just a little) in my experiments. I did not fiddle with
It would appear so. But there may be a general problem with re-integrating the punctuation DAWG. I am also still trying to modify it in a way to cover extra punctuation characters like |
I've compared these frak models:
ocrd: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata from ocrd resmgr
ubma: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069.traineddata from https://ocr-bw.bib.uni-mannheim.de/faq/
size & md5sum:
content after
combine_tessdata -u x.traineddata aa
:ubma is with .lstm-word-dawg, ocrd is without.
ocrd is 3.3M lstm size, ubma is 432k lstm size.
shouldn't ocrd use the ubma file for fraktur/gothic?
The text was updated successfully, but these errors were encountered: