Skip to content

Latest commit

 

History

History
29 lines (22 loc) · 1.06 KB

README.md

File metadata and controls

29 lines (22 loc) · 1.06 KB

Post OCR Correction

This code was used to train a new post-OCR correction model for Swedish that can be downloaded here https://huggingface.co/KBLab/swedish-ocr-correction

The model and implementations are based on Post-OCR Correction of Digitized Swedish Newspapers with ByT5 whose original model can be downloaded here.

Data

The data used to train the model is described in A Two-OCR Engine Method for Digitized Swedish Newspapers and is partially available via Språkbanken Text. The more recent annotated newspapers are not publicly available due to copyright restrictions.

Results

Model CER WER
Original OCR 3.01 13.23
viklofg 1.92 7.41
KBLab 1.57 6.23