VocADT: Adapters for Altering LLM Vocabularies - What Languages Benefit the Most?

We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model’s weights fixed. VocADT offers a flexible and scalable solution without requiring external resources or language constraints.

New Vocabulary Adapted Models

Only the input/output embeddings are replaced, while all other original weights remain fixed. These are the merged version: after training the adapters, we merge the original embeddings with the adapter to generate the new embeddings.

Name	Adapted Model	Base Model	New Vocab Size	Focused Languages
VocADT-Latin-Mistral	h-j-han/Mistral-7B-VocADT-50k-Latin	Mistral	50k	Swahili (sw), Indonesian (id), Estonian (et), Haitian Creole (ht), English (en)
VocADT-Mixed-Mistral	h-j-han/Mistral-7B-VocADT-50k-Mixed	Mistral	50k	Korean (ko), Greek (el), Russian (ru), Bulgarian (bg), English (en)
VocADT-Cyrillic-Mistral	h-j-han/Mistral-7B-VocADT-50k-Cyrillic	Mistral	50k	Russian (ru), Bulgarian (bg), Ukrainian (uk), Kazakh (kk), English (en)

VocADT-Latin-LLama	h-j-han/Llama2-7B-VocADT-50k-Latin	Llama	50k	Swahili (sw), Indonesian (id), Estonian (et), Haitian Creole (ht), English (en)
VocADT-Mixed-LLama	h-j-han/Llama2-7B-VocADT-50k-Mixed	Llama	50k	Korean (ko), Greek (el), Russian (ru), Bulgarian (bg), English (en)
VocADT-Cyrillic-LLama	h-j-han/Llama2-7B-VocADT-50k-Cyrillic	Llama	50k	Russian (ru), Bulgarian (bg), Ukrainian (uk), Kazakh (kk), English (en)

Environment Setup

$ conda create -n vocadt Python=3.11 pytorch=2.3.1  pytorch-cuda=12.1 torchvision torchaudio -c pytorch -c nvidia
$ conda activate vocadt
$ pip install -r requirements.txt

Evaluation

We evaluate adaptation methods with multilingual benchmarks of various tasks including MT, natural language inference (NLI), common sense reasoning, and multiple choice question answering (QA).

Machine Translation (MT)

For MT of English to non-English (en-xx) and non-English to English (xx-en), we use FLORES as it supports all the languages that we experiment with. We use five-shot MT prompting for the model from the adaptation phase. Please refer to ./scripts/eval_mt.sh for full commands.

$ python vocadt/decode_llm_mt.py --model_name_or_path=h-j-han/Mistral-7B-VocADT-50k-Latin --src=sw --tgt=en --nsample=100 # for simple test run

or

$ ./scripts/eval_mt.sh

We assess the translation quality with xCOMET, which produces a score of increasing quality ranging from 0 to 1. Make sure you check that evaluation model is authorized/ready to be used. You can do the evaluation separately:

$ python vocadt/eval_comet.py --input_file=outputs/Mistral-7B-VocADT-50k-Latin/flores100.sw-en.5shot.tsv

non-MT

We experiment with non-MT task of XNLI (NLI), XCOPA (common sense reasoning), Belebele (QA), Multilingual MMLU (QA). Please refer to ./scripts/eval_non-mt.sh for full commands.

$ accelerate launch -m lm_eval --model hf --model_args pretrained=h-j-han/Mistral-7B-VocADT-50k-Latin --tasks xnli_sw --num_fewshot 0 # for simple test run

or

$ ./scripts/eval_non-mt.sh

Training

(code for training to be added soon)

Reference

Please find details in this paper:

@inproceedings{
han2025adapters,
title={Adapters for Altering {LLM} Vocabularies: What Languages Benefit the Most?},
author={HyoJung Han and Akiko Eriguchi and Haoran Xu and Hieu Hoang and Marine Carpuat and Huda Khayrallah},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=KxQRHOre9D}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VocADT: Adapters for Altering LLM Vocabularies - What Languages Benefit the Most?

New Vocabulary Adapted Models

Environment Setup

Evaluation

Machine Translation (MT)

non-MT

Training

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
outputs/Mistral-7B-VocADT-50k-Latin		outputs/Mistral-7B-VocADT-50k-Latin
scripts		scripts
vocadt		vocadt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

h-j-han/VocADT

Folders and files

Latest commit

History

Repository files navigation

VocADT: Adapters for Altering LLM Vocabularies - What Languages Benefit the Most?

New Vocabulary Adapted Models

Environment Setup

Evaluation

Machine Translation (MT)

non-MT

Training

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages