This code was used to train a new post-OCR correction model for Swedish that can be downloaded here https://huggingface.co/KBLab/swedish-ocr-correction
The model and implementations are based on Post-OCR Correction of Digitized Swedish Newspapers with ByT5 whose original model can be downloaded here.
The data used to train the model is described in A Two-OCR Engine Method for Digitized Swedish Newspapers and is partially available via Språkbanken Text. The more recent annotated newspapers are not publicly available due to copyright restrictions.
Model | CER | WER |
---|---|---|
Original OCR | 3.01 | 13.23 |
viklofg | 1.92 | 7.41 |
KBLab | 1.57 | 6.23 |