Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on pretrained models and conversion to ALTO XML #19

Open
mcoonen opened this issue Jan 25, 2024 · 6 comments
Open

Question on pretrained models and conversion to ALTO XML #19

mcoonen opened this issue Jan 25, 2024 · 6 comments

Comments

@mcoonen
Copy link

mcoonen commented Jan 25, 2024

Hi Rutger and other people of the Loghi-community,

Thank you for your great work on Loghi and the underlying set of tooling.

This post is not really an issue, but more of a question. We're making our first steps in the transcription journey with Loghi at Maastricht University Library. I succeeded in transcribing some pages with the pretrained models that are linked from the readme.

However, the results are not satisfactory. The content that I'm transcribing is handwritten and printed Dutch text from the 20th century. The output that Loghi generates feels a bit like old Dutch with a lot of uses of 'ae', 'y' instead of 'ij' and 's' instead of 'z'. My first thought is that this might be due to the model used, which is, according to the readme, trained on "17th and 18th century handwritten dutch."

So, I'm looking for a better model to work with our content of the 20th century. Are there any pretrained models available for download or do you recommend me to start training my own model?

My second question is about converting the PageXML output to ALTO XML. I found this very useful set of Jupyter notebooks to work with the PageXML format. At first glance this does not include code for converting to ALTO XML.
Any recommendations that guide me in the proper direction are welcome.

Thank you in advance.
Best regards,
Maarten Coonen

@rvankoert
Copy link
Collaborator

Yes that seems very likely that both writing style and language are too much off.
Currently we don't have a public model for 20th dutch print/handwritten. We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

You could train your own model using finetuning of the current public model, which should require very little groundtruth on your side. Or alternatively build a model from scratch. In that case I suggest you take a look at MinionGeneratePageImages which can be used to create synthetic ground truth: https://github.com/knaw-huc/loghi-tooling

OCR-D created something for converting page to alto:
https://github.com/kba/page-to-alto

We also have some internal tooling for conversion, but that still needs a bit of work before public release.

I'll keep this issue open for now as someone else might have a 20th century dutch model

@coret
Copy link

coret commented Jan 26, 2024

We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

I'm curious: are these models trained on CABR material (which also becomes public in 2025)?

@rvankoert
Copy link
Collaborator

@coret your hunch is correct.

@mcoonen
Copy link
Author

mcoonen commented Jan 30, 2024

Thank you very much for your answers @rvankoert .

OCR-D created something for converting page to alto:
https://github.com/kba/page-to-alto

Thanks for hinting to that one. I've checked it out and it works like a charm!

We will probably make a few 20th century models available in january 2025 as an embargo will be lifted by then.

Great! I will be on the lookout for those. Also, thanks for keeping the issue open in case someone appears with a model trained on 20th century data. In the mean time I will try to train my own model.

You could train your own model using finetuning of the current public model, which should require very little groundtruth on your side.

Apologies for my ignorance, but how can I fine tune the model? I found files like laypa/general/baseline/config.yaml and loghi-htr/generic-2023-02-15/file.txt but I am not sure what to change in there and if these are even the files I'm looking for.
Are there any instructions or documentation that I can read?

@mcoonen
Copy link
Author

mcoonen commented Jan 31, 2024

Hi,

Answering my own question here. I think I managed to create training data and trained my own model by fine tuning the public model.

Briefly, I took the following approach:

  1. Put images in a folder
  2. Run Loghi (./na-pipeline.sh) using a pretrained model to generate Page XML
  3. Edit the generated Page XML files and correct any mistakes
  4. Use these curated Page XML and the images for creating training data (snippets)
  5. Edit the na-pipeline-train.sh script. Set the HTRBASEMODEL= to public-models/loghi-htr/generic-2023-02-15. Set USEBASEMODEL=1 and set listdir= to the path where the training data is stored.
  6. Run the training script (./na-pipeline-train.sh)
  7. Use the resulting model by editing the na-pipeline.sh file and setting the HTRLOGHIMODEL to the model created in step 6.
  8. Perform a subsequent Loghi run (./na-pipeline.sh) on a new set of images to be transcribed.

I found the instructions in the readme a bit puzzling and I see possibilities to improve. @rvankoert would you appreciate a Pull Request in which I rewrite and extend some parts of the readme to make things more clear?

@rvankoert
Copy link
Collaborator

Sorry for the late response: Yes, we would definitely appreciate help in making the Readme better :)

rvankoert added a commit that referenced this issue Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants