In this repo, we supply the codes and scripts for the implementation of LCS.
Preprocess detail can be found in scripts/data
.
Training scripts can be found in scripts/train
.
Inference scripts can be found in scripts/test
.
In this section, we introduce two branches of LCS implementation based on the open-source toolkit fairseq (version 1.0.0). (Both are placed in fairseq
, and we only place the modified python files to save space.)
The first branch
We report all scores in our paper with the first branch (fairseq-converter)
In this implementation, we place the source language tag on the encoder side and the target language tag on the decoder side.
Like this:
source:
<en> Hello, how are you?
target:
<de> Hallo, wie geht’s?
In order to acquire the target language on the encoder side, we adjust the following python files to get the desired target language.
fairseq-converter
└── fairseq
├── criterions
│ └── label_smoothed_cross_entropy_le.py
├── models
│ ├── fairseq_encoder.py
│ └── transformer
│ ├── transformer_config.py
│ └── transformer_encoder.py
├── sequence_generator.py
└── tasks
└── translation_label.py
And the corresponding scripts of three datasets for preparing, training, and testing are placed in scripts
.
The second branch
To simplify the implementation, we provide the second branch (fairseq-LCS)
In this implementation, we place the extra target language tag on the encoder side, which is treated as the padding token during the calculation.
Like this:
source:
<de> <en> Hello, how are you?
target:
<de> Hallo, wie geht’s?
Compared to the first branch, this one only has two modified python files as the following:
fairseq-LCS
└── fairseq
└── models
└── transformer
├── transformer_config.py
└── transformer_encoder.py
And we also provide the corresponding training and test scripts in scripts
with the _2
suffix.
Difference
We examine the difference between both implementations in OPUS-100 dataset.
In the fair comparison setting, we list the scores of both implementations.
Implementation | Supervised | Zero-Shot | Accuracy |
---|---|---|---|
fairseq-converter (first) | 24.80 | 15.22 | 85.35 |
fairseq-LCS (second) | 24.63 | 15.19 | 85.39 |
Supervised and Zero-shot represent the average scareBLEU (%), and Accuracy represents the language accuracy on the zero-shot translation. The above two models are trained with setting k to 2.
The above table shows that the first branch yields a little advantage and the difference is negligible.
If you find this repo useful for your research, please consider citing our paper:
@article{sun2024lcs,
title={LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation},
author={Sun, Zengkui and Liu, Yijin and Meng, Fandong and Xu, Jinan and Chen, Yufeng and Zhou, Jie},
journal={arXiv preprint arXiv:2406.02876},
year={2024}
}