frak models in ocrd resmgr #404

jbarth-ubhd · 2023-12-21T11:54:07Z

I've compared these frak models:

ocrd: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata from ocrd resmgr

ubma: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069.traineddata from https://ocr-bw.bib.uni-mannheim.de/faq/

size & md5sum:

-rw-rw-r-- 1 jb jb 3421140 Mär 27  2021 ocrd--frak2021-0.905.traineddata
234e8bb819042f615576bd01aa2419fd  ocrd--frak2021-0.905.traineddata
-rw-rw-r-- 1 jb jb 5060763 Dez  9  2021 ubma--frak2021_1.069.traineddata
9405b1603db21cb066e4e7614a405dd4  ubma--frak2021_1.069.traineddata

content after combine_tessdata -u x.traineddata aa :

jb@nuc:~/models$ LC_ALL=C ls -lh ocrd ubma
ocrd:
total 3.3M
-rw-rw-r-- 1 jb jb 3.3M Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  345 Dec 21 12:18 extr.log

ubma:
total 4.9M
-rw-rw-r-- 1 jb jb 432K Dec 21 12:18 aa.lstm
-rw-rw-r-- 1 jb jb 6.3K Dec 21 12:18 aa.lstm-number-dawg
-rw-rw-r-- 1 jb jb 4.5K Dec 21 12:18 aa.lstm-punc-dawg
-rw-rw-r-- 1 jb jb 2.8K Dec 21 12:18 aa.lstm-recoder
-rw-rw-r-- 1 jb jb  22K Dec 21 12:18 aa.lstm-unicharset
-rw-rw-r-- 1 jb jb 4.4M Dec 21 12:18 aa.lstm-word-dawg
-rw-rw-r-- 1 jb jb   30 Dec 21 12:18 aa.version
-rw-rw-r-- 1 jb jb  553 Dec 21 12:18 extr.log

ubma is with .lstm-word-dawg, ocrd is without.

ocrd is 3.3M lstm size, ubma is 432k lstm size.

shouldn't ocrd use the ubma file for fraktur/gothic?

The text was updated successfully, but these errors were encountered:

stweil · 2024-01-12T16:01:36Z

Which one of the two models is "better", and how did you compare them?

jbarth-ubhd · 2024-01-12T16:02:57Z

Comparison in sense of "check if the model files have the same content".

bertsky · 2024-01-18T14:31:07Z

That's strange indeed. It's not to be expected from the vanilla tesstrain rules (even the fast variant just does ConvertToInt). And the concrete wordlist looks very awkward (contains 400k fullforms, nearly half of which are made of strange punctuation characters indicative of absent tokenisation, and the actual tokens are clearly scraped off the web, not historic at all). I would understand if the wordlist from deu or frk is used in frak2021, but that's not the case at all.

@stweil can you explain?

stweil · 2024-01-18T15:06:44Z

frak2021_1.069.traineddata was made from the original training result, but with additional components like wordlist, number und punctuation hints (frak2021_1.069.lstm-word-dawg, frak2021_1.069.lstm-number-dawg, frak2021_1.069.lstm-punc-dawg). Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check). Sort the word list before comparing it with other word lists.

Because of the additional components the file frak2021_1.069.traineddata is larger.

Typically models with a (ideally domain specific) wordlist can achieve slightly higher recognition rates, but sometimes it can also lead to OCR results which differ from the printed text.

And yes, this word list contains a lot of entries which should be removed. That's inherited from all standard Tesseract word lists.

bertsky · 2024-01-18T16:39:40Z

Those additional components are based on the components from a Tesseract standard model (as far as I remember on Fraktur.traineddata, but I'd have to check)

No, the latter word list is about twice the size, also with texts from the web, but contains none of these strange words with punctuation (non-tokenised), and does contain ſ, which yours does not.

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

Regardless, the word list in that model files looks exceptionally bad (much worse than the Tesseract word lists) and should be improved.

bertsky · 2024-01-18T18:29:09Z

Of course it would be preferable to have a standard dictionary for (say) 18th century German. We could export the fullforms from DTA lexdb, for example. (But this must be accompanied by a thorough evaluation.)

I have now distilled a list of full forms, capping at different frequencies, respectively:

>10: 314248 words
>50: 100516 words
>100: 60403 words

I filtered by part-of-speech, removing punctuation, numbers and non-words (XY):

select trim(u,'"') from csv where f > 100 and p != "$(" and p != "$," and p != "$." and p != "FM.xy" and p != "CARD" and p != "XY";

Furthermore, I removed those entries which have not been properly tokenised (indicated by leading punctuation) or are merely numbers (but still do not get p=CARD):

grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$'

The quality is very good!

Maybe I'll also recompose the number and punc DAWGs for the additional historic patterns (e.g. ⸗ instead of hyphen, solidus instead of comma) and remove the contemporary ones (€ sign etc).

I will try to use this with frak2021, but also GT4HistOCR and others.

I guess I'll do some recognition experiments and evaluation before publishing the modified models.

bertsky · 2024-01-29T12:16:38Z

10: 314248 words

50: 100516 words

100: 60403 words

I will try to use this with frak2021, but also GT4HistOCR and others.

Done: see

stweil · 2024-01-29T12:44:51Z

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts.
It would be more interesting to use it with german_print.

bertsky · 2024-01-29T13:35:27Z

In my tests frak2021 is much better than GT4HistOCR, so using it with GT4HistOCR might not be worth the efforts.
It would be more interesting to use it with german_print.

sure, that's why it's among the models I build the dict into – see full list of assets

Some evaluation (which material which model whether dict or not and which cap freq preferable) will follow.

jbarth-ubhd · 2024-01-30T12:32:21Z

Here my small tool for checking the wordlists of .traineddata files:

https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20

jbarth-ubhd · 2024-01-30T13:18:23Z

@bertsky : but frak2021_dta10+100 do not contain »ſ«:

AMBIGIOUS (EXCERPT): 1sten A/ AP. As. AZ. Basalt- Bauers- Besitz- Bietsch- c. cas. Centralbl. Chrysost. cl. Corn. dial. Diener. Ding- Dinge. Ebd. Eigen- eigentl. Eisen. euch. Eurip. fgm. FML. fundam. g1 Gebiets- Geitz- Generals- G.n GOtts Griseb. Haubt- haus- HErre hsg. inst. Jahrbb. Jungfrau- k. Kg. Kiefer- Lactant. lap. legit. Loose. Magdalenen- Mai- Mehl= Namen. nat. neu- NJmb Normal- O1 Pall. pan. Pfand- Pfl. proc. Reb- redet. Rev. Rhodigin. Rich. Roman- Sc. Schulen. Schweine- Sed. SEin SJndt Spargel- Spitz- Strom. Syllog. Trauben- Trav. Trias- Trift- VIEUSSENS. VVilliam Wach- W.-B. wohl- Wolf. XCVII. y2 Ztg. zwei-

264677 lines
0.00 % lines with »ſ«
0.64 % lines all-UPPERCASE
3.51 % lines ambigious

bertsky · 2024-01-30T14:29:19Z

Indeed – something went wrong. Thanks @jbarth-ubhd, I'll investigate!

bertsky · 2024-01-30T16:02:42Z

Ok, I found the problem. See new release.

346632 lines
16.37 % lines with »ſ«
0.19 % lines all-UPPERCASE
132.80 % lines ambigious

What's with the > 100% BTW?

jbarth-ubhd · 2024-01-30T16:04:51Z

>100% is because I've inspected only every 1/0.003th word, to keep output compact and multiply the count - I'll have a look at this.

jbarth-ubhd · 2024-01-30T16:11:26Z

Just inspected frak2021_dta50.traineddata.

Ambigious:

welchẽ   (not NFC) Welchs     weñ      (not NFC) Weñ      (not NFC) wenigen

a lot of spaces after words(?). And not NFC (double counting, my bug.)

Spaces were not in frak2021_dta10/100 I've downloaded till Jan 30 11:55.

jbarth-ubhd · 2024-01-30T17:13:59Z

now with much nicer output:

welchẽ␣␣␣(not NFC) Welchs␣␣␣␣ weñ␣␣␣␣␣␣(not NFC) Weñ␣␣␣␣␣␣(not NFC) wenigen␣␣␣

bertsky · 2024-01-30T19:08:29Z

a lot of spaces after words(?).

wow, I should have checked. Thanks again for being thorough @jbarth-ubhd – much appreciated!

see new release

And not NFC (double counting, my bug.)

Do we really want that? (Even if DTA decided not to do it?)

jbarth-ubhd · 2024-01-30T19:13:44Z

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

bertsky · 2024-01-30T19:34:19Z

If we want NFC? Don't know. Inserted it just because otherwise I'll don't notice this easily. I can remove this check.

I just checked: tesstrain does NFC on the input GT (via unicodedata.normalize in generate_line_box.py). And calamari-train does by default. Kraken's ketos train offers it, but it does not seem to be default.

It is also used in most CER measurement tools.

I feel obliged to comply with this obvious convention in the OCR space.

bertsky · 2024-01-30T20:25:54Z

There we go

stweil · 2024-01-30T20:44:38Z

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

And I already have a Tesseract branch which no longer requires box and lstmf files for the training.

bertsky · 2024-01-30T20:53:23Z

Tesseract also does NFC when generating lstmf files, but I'd like to change that because I want to be able to train models with decomposed umlauts and other characters with diacritica.

Right, but you can choose, --norm_mode (or NORM_MODE in tesstrain) Normalization mode:

Combine graphemes,
Split graphemes
Pure unicode

And it's configured differently for various mother tongues.

So my fixed NFC in the DTA LM was premature is what you are saying @stweil?

stweil · 2024-01-30T21:02:25Z

No, my comment was just meant as an information for you.

jbarth-ubhd · 2024-02-08T14:45:42Z

Comparison frak2021 … _dta50:

4160da1e088452fcec11df5a411d9a91 /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021_dta50.traineddata

234e8bb819042f615576bd01aa2419fd /usr/local/ocrd-models/ocrd-tesserocr-recognize/frak2021.traineddata

jbarth-ubhd · 2024-02-09T11:14:52Z

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

stweil · 2024-03-01T15:13:42Z

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)? That's not the kind of impact which is desired.

bertsky · 2024-03-01T15:37:34Z

with ..._dta50 there is missing some punctation, but there is almost no word diff ... I've expected the dictionary to have greater impact.

me too. But the averages do go down overall (if just a little) in my experiments.

I did not fiddle with WORD_DAWG_FACTOR yet.

So it looks like using a dictionary makes recognition of punctuation worse (unless the dictionary also contains the words with the punctuation)?

It would appear so. But there may be a general problem with re-integrating the punctuation DAWG. I am also still trying to modify it in a way to cover extra punctuation characters like ⸗ and — and –. The problem is that Tesseract does not have code to de/serialise it from/to anything other than binary form. (I would have expected at least one of the old automaton text formats like AT&T. Unclear how these FSTs came to be in the first place. Manually?)

stweil added the question Further information is requested label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frak models in ocrd resmgr #404

frak models in ocrd resmgr #404

jbarth-ubhd commented Dec 21, 2023

stweil commented Jan 12, 2024

jbarth-ubhd commented Jan 12, 2024

bertsky commented Jan 18, 2024

stweil commented Jan 18, 2024 •

edited

Loading

bertsky commented Jan 18, 2024

bertsky commented Jan 18, 2024 •

edited

Loading

bertsky commented Jan 29, 2024

stweil commented Jan 29, 2024

bertsky commented Jan 29, 2024

jbarth-ubhd commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

bertsky commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024 •

edited

Loading

jbarth-ubhd commented Jan 30, 2024 •

edited

Loading

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

bertsky commented Jan 30, 2024

stweil commented Jan 30, 2024 •

edited

Loading

bertsky commented Jan 30, 2024 •

edited

Loading

stweil commented Jan 30, 2024

jbarth-ubhd commented Feb 8, 2024

jbarth-ubhd commented Feb 9, 2024 •

edited

Loading

stweil commented Mar 1, 2024

bertsky commented Mar 1, 2024

frak models in ocrd resmgr #404

frak models in ocrd resmgr #404

Comments

jbarth-ubhd commented Dec 21, 2023

stweil commented Jan 12, 2024

jbarth-ubhd commented Jan 12, 2024

bertsky commented Jan 18, 2024

stweil commented Jan 18, 2024 • edited Loading

bertsky commented Jan 18, 2024

bertsky commented Jan 18, 2024 • edited Loading

bertsky commented Jan 29, 2024

stweil commented Jan 29, 2024

bertsky commented Jan 29, 2024

jbarth-ubhd commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

bertsky commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024 • edited Loading

jbarth-ubhd commented Jan 30, 2024 • edited Loading

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

jbarth-ubhd commented Jan 30, 2024

bertsky commented Jan 30, 2024

bertsky commented Jan 30, 2024

stweil commented Jan 30, 2024 • edited Loading

bertsky commented Jan 30, 2024 • edited Loading

stweil commented Jan 30, 2024

jbarth-ubhd commented Feb 8, 2024

jbarth-ubhd commented Feb 9, 2024 • edited Loading

stweil commented Mar 1, 2024

bertsky commented Mar 1, 2024

stweil commented Jan 18, 2024 •

edited

Loading

bertsky commented Jan 18, 2024 •

edited

Loading

jbarth-ubhd commented Jan 30, 2024 •

edited

Loading

jbarth-ubhd commented Jan 30, 2024 •

edited

Loading

stweil commented Jan 30, 2024 •

edited

Loading

bertsky commented Jan 30, 2024 •

edited

Loading

jbarth-ubhd commented Feb 9, 2024 •

edited

Loading