Releases: globalise-huygens/language-identification-data
Releases · globalise-huygens/language-identification-data
Language Identifications
This is the first release of the Globalise VOC Corpus Language Identifications. We automatically identified c. 12,200 non-Dutch language pages written (in part or in whole) in French, Latin, English, Portuguese, Spanish, German, Italian, Danish, and Malay (in Latin-script). In addition, we manually identified a further c. 180 pages written (in part or in whole) in several non-Latin script languages including Malay (in Arabic script), Chinese, Persian, Tamil, and Sinhala as well as pages written in cipher.
We very much welcome further contributions and corrections to this data from the community.