Releases · globalise-huygens/language-identification-data

This is the first release of the Globalise VOC Corpus Language Identifications. We automatically identified c. 12,200 non-Dutch language pages written (in part or in whole) in French, Latin, English, Portuguese, Spanish, German, Italian, Danish, and Malay (in Latin-script). In addition, we manually identified a further c. 180 pages written (in part or in whole) in several non-Latin script languages including Malay (in Arabic script), Chinese, Persian, Tamil, and Sinhala as well as pages written in cipher.

We very much welcome further contributions and corrections to this data from the community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: globalise-huygens/language-identification-data

Language Identifications