This repository contains the models and scripts from the paper Sequence-to-Sequence Spanish Pre-trained Language Models.
All our pre-trained models can be found on the HuggingFace Hub.
BARTO and T5S are variants of BART and T5, respectively, pre-trained exclusively on Spanish corpora in a self-supervised manner. BARTO and T5S are base-sized versions comprising approximately 140 million and 220 million parameters, respectively.
You can load T5S like this:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("vgaraujov/t5-base-spanish")
model = AutoModel.from_pretrained("vgaraujov/t5-base-spanish")
You can load BARTO like this:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("vgaraujov/bart-base-spanish")
model = AutoModel.from_pretrained("vgaraujov/bart-base-spanish")
LEDO was built to process sequences longer sequences by leveraging the weights of BARTO. To process 16K tokens, BARTO's position embedding matrix was copied 16 times.
You can load LEDO like this:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("vgaraujov/led-base-16384-spanish")
model = AutoModel.from_pretrained("vgaraujov/led-base-16384-spanish")
BERT2BERT-style models were introduced as baselines. By leveraging Encoder Decoder Models from Huggingface and using BETO and RoBERTa-BNE checkpoints, we initialized BETO2BETO and RoBERTa2RobERTa.
You can load BETO2BETO like this:
from transformers import EncoderDecoderModel
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
"dccuchile/bert-base-spanish-wwm-cased",
"dccuchile/bert-base-spanish-wwm-cased",
tie_encoder_decoder=False
)
Note: tie_encoder_decoder=True
initializes BETOShare or RoBERTaShare.
To fine-tune BARTO, T5S, and LEDO, we rely on HuggingFace examples for summarization and translation.
For tasks like generative question-answering, split-and-rephrase, and dialogue, we implemented additional scripts found in this repository. Additionally, we implemented the script versions to experiment with BERT2BERT-style models, which are also found in this repository.
We include experiment files that you can run to replicate our results. For example, running:
bash run_summarization.sh
If you find this repository useful for your research, please consider citing our paper:
@inproceedings{araujo-etal-2024-sequence-sequence,
title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
author = "Araujo, Vladimir and
Trusca, Maria Mihaela and
Tufi{\~n}o, Rodrigo and
Moens, Marie-Francine",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1283",
pages = "14729--14743",
}
This work was funded by the European Research Council Advanced Grant 788506 and supported by the Google Cloud Research Credits program (GCP) and TPU Research Cloud program (TRC).