-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on pretrained models and conversion to ALTO XML #19
Comments
Yes that seems very likely that both writing style and language are too much off. You could train your own model using finetuning of the current public model, which should require very little groundtruth on your side. Or alternatively build a model from scratch. In that case I suggest you take a look at MinionGeneratePageImages which can be used to create synthetic ground truth: https://github.com/knaw-huc/loghi-tooling OCR-D created something for converting page to alto: We also have some internal tooling for conversion, but that still needs a bit of work before public release. I'll keep this issue open for now as someone else might have a 20th century dutch model |
I'm curious: are these models trained on CABR material (which also becomes public in 2025)? |
@coret your hunch is correct. |
Thank you very much for your answers @rvankoert .
Thanks for hinting to that one. I've checked it out and it works like a charm!
Great! I will be on the lookout for those. Also, thanks for keeping the issue open in case someone appears with a model trained on 20th century data. In the mean time I will try to train my own model.
Apologies for my ignorance, but how can I fine tune the model? I found files like |
Hi, Answering my own question here. I think I managed to create training data and trained my own model by fine tuning the public model. Briefly, I took the following approach:
I found the instructions in the readme a bit puzzling and I see possibilities to improve. @rvankoert would you appreciate a Pull Request in which I rewrite and extend some parts of the readme to make things more clear? |
Sorry for the late response: Yes, we would definitely appreciate help in making the Readme better :) |
Hi Rutger and other people of the Loghi-community,
Thank you for your great work on Loghi and the underlying set of tooling.
This post is not really an issue, but more of a question. We're making our first steps in the transcription journey with Loghi at Maastricht University Library. I succeeded in transcribing some pages with the pretrained models that are linked from the readme.
However, the results are not satisfactory. The content that I'm transcribing is handwritten and printed Dutch text from the 20th century. The output that Loghi generates feels a bit like old Dutch with a lot of uses of 'ae', 'y' instead of 'ij' and 's' instead of 'z'. My first thought is that this might be due to the model used, which is, according to the readme, trained on "17th and 18th century handwritten dutch."
So, I'm looking for a better model to work with our content of the 20th century. Are there any pretrained models available for download or do you recommend me to start training my own model?
My second question is about converting the PageXML output to ALTO XML. I found this very useful set of Jupyter notebooks to work with the PageXML format. At first glance this does not include code for converting to ALTO XML.
Any recommendations that guide me in the proper direction are welcome.
Thank you in advance.
Best regards,
Maarten Coonen
The text was updated successfully, but these errors were encountered: