Skip to content

Releases: globalise-huygens/language-identification-data

Language Identifications

12 Dec 14:13
c003bd8
Compare
Choose a tag to compare

This is the first release of the Globalise VOC Corpus Language Identifications. We automatically identified c. 12,200 non-Dutch language pages written (in part or in whole) in French, Latin, English, Portuguese, Spanish, German, Italian, Danish, and Malay (in Latin-script). In addition, we manually identified a further c. 180 pages written (in part or in whole) in several non-Latin script languages including Malay (in Arabic script), Chinese, Persian, Tamil, and Sinhala as well as pages written in cipher.

We very much welcome further contributions and corrections to this data from the community.