Question on pretrained models and conversion to ALTO XML #19

mcoonen · 2024-01-25T10:04:35Z

Hi Rutger and other people of the Loghi-community,

Thank you for your great work on Loghi and the underlying set of tooling.

This post is not really an issue, but more of a question. We're making our first steps in the transcription journey with Loghi at Maastricht University Library. I succeeded in transcribing some pages with the pretrained models that are linked from the readme.

However, the results are not satisfactory. The content that I'm transcribing is handwritten and printed Dutch text from the 20th century. The output that Loghi generates feels a bit like old Dutch with a lot of uses of 'ae', 'y' instead of 'ij' and 's' instead of 'z'. My first thought is that this might be due to the model used, which is, according to the readme, trained on "17th and 18th century handwritten dutch."

So, I'm looking for a better model to work with our content of the 20th century. Are there any pretrained models available for download or do you recommend me to start training my own model?

My second question is about converting the PageXML output to ALTO XML. I found this very useful set of Jupyter notebooks to work with the PageXML format. At first glance this does not include code for converting to ALTO XML.
Any recommendations that guide me in the proper direction are welcome.

Thank you in advance.
Best regards,
Maarten Coonen

rvankoert · 2024-01-25T12:01:29Z

Yes that seems very likely that both writing style and language are too much off.
Currently we don't have a public model for 20th dutch print/handwritten. We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

You could train your own model using finetuning of the current public model, which should require very little groundtruth on your side. Or alternatively build a model from scratch. In that case I suggest you take a look at MinionGeneratePageImages which can be used to create synthetic ground truth: https://github.com/knaw-huc/loghi-tooling

OCR-D created something for converting page to alto:
https://github.com/kba/page-to-alto

We also have some internal tooling for conversion, but that still needs a bit of work before public release.

I'll keep this issue open for now as someone else might have a 20th century dutch model

coret · 2024-01-26T10:39:56Z

We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

I'm curious: are these models trained on CABR material (which also becomes public in 2025)?

rvankoert · 2024-01-26T11:39:20Z

@coret your hunch is correct.

mcoonen · 2024-01-30T18:00:30Z

Thank you very much for your answers @rvankoert .

OCR-D created something for converting page to alto:
https://github.com/kba/page-to-alto

Thanks for hinting to that one. I've checked it out and it works like a charm!

We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

Great! I will be on the lookout for those. Also, thanks for keeping the issue open in case someone appears with a model trained on 20th century data. In the mean time I will try to train my own model.

You could train your own model using finetuning of the current public model, which should require very little groundtruth on your side.

Apologies for my ignorance, but how can I fine tune the model? I found files like laypa/general/baseline/config.yaml and loghi-htr/generic-2023-02-15/file.txt but I am not sure what to change in there and if these are even the files I'm looking for.
Are there any instructions or documentation that I can read?

mcoonen · 2024-01-31T14:22:35Z

Hi,

Answering my own question here. I think I managed to create training data and trained my own model by fine tuning the public model.

Briefly, I took the following approach:

Put images in a folder
Run Loghi (./na-pipeline.sh) using a pretrained model to generate Page XML
Edit the generated Page XML files and correct any mistakes
Use these curated Page XML and the images for creating training data (snippets)
Edit the na-pipeline-train.sh script. Set the HTRBASEMODEL= to public-models/loghi-htr/generic-2023-02-15. Set USEBASEMODEL=1 and set listdir= to the path where the training data is stored.
Run the training script (./na-pipeline-train.sh)
Use the resulting model by editing the na-pipeline.sh file and setting the HTRLOGHIMODEL to the model created in step 6.
Perform a subsequent Loghi run (./na-pipeline.sh) on a new set of images to be transcribed.

I found the instructions in the readme a bit puzzling and I see possibilities to improve. @rvankoert would you appreciate a Pull Request in which I rewrite and extend some parts of the readme to make things more clear?

rvankoert · 2024-02-09T11:13:13Z

Sorry for the late response: Yes, we would definitely appreciate help in making the Readme better :)

mergeback

rvankoert added a commit that referenced this issue Jan 7, 2025

Merge pull request #19 from knaw-huc/main

495f4a9

mergeback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on pretrained models and conversion to ALTO XML #19

Question on pretrained models and conversion to ALTO XML #19

mcoonen commented Jan 25, 2024

rvankoert commented Jan 25, 2024

coret commented Jan 26, 2024 •

edited

Loading

rvankoert commented Jan 26, 2024

mcoonen commented Jan 30, 2024 •

edited

Loading

mcoonen commented Jan 31, 2024

rvankoert commented Feb 9, 2024

Question on pretrained models and conversion to ALTO XML #19

Question on pretrained models and conversion to ALTO XML #19

Comments

mcoonen commented Jan 25, 2024

rvankoert commented Jan 25, 2024

coret commented Jan 26, 2024 • edited Loading

rvankoert commented Jan 26, 2024

mcoonen commented Jan 30, 2024 • edited Loading

mcoonen commented Jan 31, 2024

rvankoert commented Feb 9, 2024

coret commented Jan 26, 2024 •

edited

Loading

mcoonen commented Jan 30, 2024 •

edited

Loading