IndoNLG

Baca README ini dalam Bahasa Indonesia.

⚠️ Update 16/11/2024: We update the links to the datasets and fasttext models in IndoNLG!

IndoNLG is a collection of Natural Language Generation (NLG) resources for Bahasa Indonesia with 6 kind of downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBART and IndoGPT) trained with around 4 billion word corpus (Indo4B-Plus), around ~25 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Research Paper

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://aclanthology.org/2021.emnlp-main.699. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@inproceedings{cahyawijaya-etal-2021-indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel and Winata, Genta Indra and Wilie, Bryan and Vincentio, Karissa and Li, Xiaohong and Kuncoro, Adhiguna and Ruder, Sebastian and Lim, Zhi Yuan and Bahar, Syafri and Khodra, Masayu and Purwarianti, Ayu and Fung, Pascale",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.699",
    pages = "8875--8898",
}

Example

We provide example to load IndoBART model and fine-tune the model on Machine Translation task.
Check our example on the following Link

How to contribute to IndoNLG?

Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

IndoNLG Downstream Task

Download and unzip the dataset from this [Link]

Indo4B-Plus Dataset

We provide the access to our large pretraining dataset.

Indo4B-Plus Dataset Upscaled (~25 GB uncompressed, 9.4 GB compressed) [Link]

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

IndoBART [Link]
IndoBART-v2 [Link]
IndoGPT [Link]

Indobenchmark Toolkit

We provide the toolkit to use the IndoNLGTokenizer in [Link]

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
dataset		dataset
examples		examples
modules		modules
tutorial		tutorial
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.id.md		README.id.md
README.md		README.md
evaluate.py		evaluate.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndoNLG

Research Paper

Example

How to contribute to IndoNLG?

IndoNLG Downstream Task

Indo4B-Plus Dataset

IndoBART and IndoGPT Models

Indobenchmark Toolkit

About

Releases

Packages

Contributors 3

Languages

License

IndoNLP/indonlg

Folders and files

Latest commit

History

Repository files navigation

IndoNLG

Research Paper

Example

How to contribute to IndoNLG?

IndoNLG Downstream Task

Indo4B-Plus Dataset

IndoBART and IndoGPT Models

Indobenchmark Toolkit

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages