ALBERT-Mongolian

This repo provides pretrained ALBERT model ("A Lite" version of BERT) and SentencePiece model (unsupervised text tokenizer and detokenizer) trained on Mongolian text corpus.

Contents:

Usage
Tutorials
Results
Reproduce
Reference
Citation

Usage

You can use ALBERT-Mongolian in both PyTorch and TensorFlow2.0 using transformers library.

link to HuggingFace model card 🤗

import torch
from transformers import AlbertTokenizer, AlbertForMaskedLM

tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/albert-mongolian')
model = AlbertForMaskedLM.from_pretrained('bayartsogt/albert-mongolian')

Tutorials

[Colab] Text classification using TPU on Colab: ALBERT_Mongolian_text_classification.ipynb
[Colab] Masked Language Modeling (MLM) on Colab: ALBERT_Mongolian_MLM.ipynb
[Video] AWS-Mongolians e-meetup #3:

Results

Model	Problem	Task	weighted F1
ALBERT-base	Text Classification	Eduge dataset	0.90
...	...	...	...

Comparison between ALBERT and BERT

Note that While ALBERT-base is compatible in terms of results shown below, it is over 10 times (only 135MB) smaller than BERT-base (1.2GB).

ALBERT-Mongolian:

                          precision    recall  f1-score   support

            байгал орчин       0.85      0.83      0.84       999
               боловсрол       0.80      0.80      0.80       873
                   спорт       0.98      0.98      0.98      2736
               технологи       0.88      0.93      0.91      1102
                 улс төр       0.92      0.85      0.89      2647
              урлаг соёл       0.93      0.94      0.94      1457
                   хууль       0.89      0.87      0.88      1651
             эдийн засаг       0.83      0.88      0.86      2509
              эрүүл мэнд       0.89      0.92      0.90      1159

                accuracy                           0.90     15133
               macro avg       0.89      0.89      0.89     15133
            weighted avg       0.90      0.90      0.90     15133

BERT-Mongolian: from Mongolian Text Classification

                          precision    recall  f1-score   support

            байгал орчин       0.82      0.84      0.83       999
               боловсрол       0.91      0.70      0.79       873
                   спорт       0.97      0.98      0.97      2736
               технологи       0.91      0.85      0.88      1102
                 улс төр       0.87      0.86      0.86      2647
              урлаг соёл       0.88      0.96      0.92      1457
                   хууль       0.86      0.85      0.86      1651
             эдийн засаг       0.84      0.87      0.85      2509
              эрүүл мэнд       0.90      0.90      0.90      1159

                accuracy                           0.88     15133
               macro avg       0.88      0.87      0.87     15133
            weighted avg       0.88      0.88      0.88     15133

Reproduce

Pretrain from Scratch: You can follow the PRETRAIN_SCRATCH.md to reproduce the results.

Here is pretraining loss:

Reference

ALBERT - official repo
WikiExtrator
Mongolian BERT
ALBERT - Japanese
Mongolian Text Classification
You's paper
AWS-Mongolia e-meetup #3

Citation

@misc{albert-mongolian,
  author = {Bayartsogt Yadamsuren},
  title = {ALBERT Pretrained Model on Mongolian Datasets},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ALBERT-Mongolian

Usage

Tutorials

Results

Comparison between ALBERT and BERT

Reproduce

Reference

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ALBERT-Mongolian

Usage

Tutorials

Results

Comparison between ALBERT and BERT

Reproduce

Reference

Citation