- The global corpus of pre-modern texts is small when compared with modern corpora, and notwithstanding the occasional discovery does not grow.
- The state of the art of natural-language processing is data-hungry machine learning techniques.
- Hence, if research in low-resource languages like
Ancient Greek
,Latin
,Old English
,Pali
,Sanskrit
andClassical Chinese
is to be able to leverage these tools long-term, a strategy for maximizing the data inherent in the small corpus must be adopted. - Like any corpus, the pre-modern corpus can be subjected to data augmentation methods like "sliding window" and shuffling, however data augmentation only gets you so far.
- The real superpower of the corpus lies in the wealth of alternative readings present in critical editions. Every single
- Alternative readings can be either editorial conjectures or differing manuscripts.