Skip to content

Language Identifications

Latest
Compare
Choose a tag to compare
@kintopp kintopp released this 12 Dec 14:13
c003bd8

This is the first release of the Globalise VOC Corpus Language Identifications. We automatically identified c. 12,200 non-Dutch language pages written (in part or in whole) in French, Latin, English, Portuguese, Spanish, German, Italian, Danish, and Malay (in Latin-script). In addition, we manually identified a further c. 180 pages written (in part or in whole) in several non-Latin script languages including Malay (in Arabic script), Chinese, Persian, Tamil, and Sinhala as well as pages written in cipher.

We very much welcome further contributions and corrections to this data from the community.