κρῐτῐκός - automating the OCR of critical editions of pre-modern texts

The global corpus of pre-modern texts is small when compared with modern corpora, and notwithstanding the occasional discovery does not grow.
The state of the art of natural-language processing is data-hungry machine learning techniques.
Hence, if research in low-resource languages like Ancient Greek, Latin, Old English, Pali, Sanskrit and Classical Chinese is to be able to leverage these tools long-term, a strategy for maximizing the data inherent in the small corpus must be adopted.
Like any corpus, the pre-modern corpus can be subjected to data augmentation methods like "sliding window" and shuffling, however data augmentation only gets you so far.
The real superpower of the corpus lies in the wealth of alternative readings present in critical editions. Every single
Alternative readings can be either editorial conjectures or differing manuscripts.

Provide feedback