Skip to content

Sequence-to-Sequence Spanish Pre-trained Language Models

Notifications You must be signed in to change notification settings

vgaraujov/Seq2Seq-Spanish-PLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Seq2Seq Spanish Pre-trained Language Models

This repository contains the models and scripts from the paper Sequence-to-Sequence Spanish Pre-trained Language Models.

Models

All our pre-trained models can be found on the HuggingFace Hub.

BARTO and T5S are variants of BART and T5, respectively, pre-trained exclusively on Spanish corpora in a self-supervised manner. BARTO and T5S are base-sized versions comprising approximately 140 million and 220 million parameters, respectively.

You can load T5S like this:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("vgaraujov/t5-base-spanish")
model = AutoModel.from_pretrained("vgaraujov/t5-base-spanish")

You can load BARTO like this:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("vgaraujov/bart-base-spanish")
model = AutoModel.from_pretrained("vgaraujov/bart-base-spanish")

Additional Models

LEDO was built to process sequences longer sequences by leveraging the weights of BARTO. To process 16K tokens, BARTO's position embedding matrix was copied 16 times.

You can load LEDO like this:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("vgaraujov/led-base-16384-spanish")
model = AutoModel.from_pretrained("vgaraujov/led-base-16384-spanish")

BERT2BERT-style models were introduced as baselines. By leveraging Encoder Decoder Models from Huggingface and using BETO and RoBERTa-BNE checkpoints, we initialized BETO2BETO and RoBERTa2RobERTa.

You can load BETO2BETO like this:

from transformers import EncoderDecoderModel

model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "dccuchile/bert-base-spanish-wwm-cased",
    "dccuchile/bert-base-spanish-wwm-cased",
    tie_encoder_decoder=False
)

Note: tie_encoder_decoder=True initializes BETOShare or RoBERTaShare.

Fine-tuning

To fine-tune BARTO, T5S, and LEDO, we rely on HuggingFace examples for summarization and translation.

For tasks like generative question-answering, split-and-rephrase, and dialogue, we implemented additional scripts found in this repository. Additionally, we implemented the script versions to experiment with BERT2BERT-style models, which are also found in this repository.

We include experiment files that you can run to replicate our results. For example, running:

bash run_summarization.sh

Citation

If you find this repository useful for your research, please consider citing our paper:

@inproceedings{araujo-etal-2024-sequence-sequence,
    title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
    author = "Araujo, Vladimir  and
      Trusca, Maria Mihaela  and
      Tufi{\~n}o, Rodrigo  and
      Moens, Marie-Francine",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1283",
    pages = "14729--14743",
}

Acknowledgements

This work was funded by the European Research Council Advanced Grant 788506 and supported by the Google Cloud Research Credits program (GCP) and TPU Research Cloud program (TRC).