Story Books for 180 ISO-639-3 codes.
The data is available on HuggingFace (HF) at: https://huggingface.co/datasets/cis-lmu/GlotStoryBook.
from datasets import load_dataset
dataset = load_dataset('cis-lmu/GlotStoryBook')
print(dataset['train'][0]) # First row data
If you are not a fan of the HF dataloader, download it directly:
! wget https://huggingface.co/datasets/cis-lmu/GlotStoryBook/resolve/main/GlotStoryBook.csv
We do not own any of the text from which these data has been extracted. All the files are collected from the repository located at https://github.com/global-asp/. The source repository for each text and file is stored in the dataset. Each file in the dataset is associated with one license from the CC family. The licenses include 'CC BY', 'CC BY-NC', 'CC BY-NC-SA', 'CC-BY', 'CC-BY-NC', and 'Public Domain'. We also license the code, actual packaging and the metadata of these data under the cc0-1.0.
global-asp, asp-source, lcb-source, pb-source, sbc-source, gasp-mexico, global-pb, global-lcb, sbjm-source, sbug-source, sbno-source, sbk-source, sbuk-source, lida-source, asp-raw-db, global-lida, gasp-alternates, asp-new
If you use any part of this code and data in your research, please cite it (along with https://github.com/global-asp/) using the following BibTeX entry. This work is part of the GlotLID project.
@inproceedings{
kargaran2023glotlid,
title={{GlotLID}: Language Identification for Low-Resource Languages},
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
url={https://openreview.net/forum?id=dl4e3EBz5j}
}