SpeechQE: Estimating the Quality of Direct Speech Translation

We formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures.

We provide our E2E model on Huggingface Hub. The provided models corresponds to "TowerInstruct-LoRA+Adapter-pt-Fixed" in the paper.

SpeechQE for	E2E Model	Trained Domain
English-to-German Speech Translation	h-j-han/SpeechQE-TowerInstruct-7B-en2de	CoVoST2
Spanish-to-English Speech Translation	h-j-han/SpeechQE-TowerInstruct-7B-es2en	CoVoST2

SpeechQE-CoVoST2: Benchmarks and Training Corpus for SpeechQE

We subsample about 80k segments from the training set and 500 from the dev and test of facebook/covost2, then run seven different direct ST models to generate the ST hypotheses. So, the test split consists of 3500 instances(500*7). We also provide splits for each translation model.

Environment Setup

$ conda create -n speechqe Python=3.11 pytorch=2.0.1  pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia
$ conda activate speechqe
$ pip install -r requirements.txt

Cascaded SpeechQE

We use Unbabel/XCOMET-XL and google/metricx-23-xl-v2p0 for cascaded SpeechQE systems. ASR system we mainly report in the cascaded system is openai/whisper-large-v3.

We provide all the result data in data folder, where data/cas_speechqe is results of cascaded SpeechQE with the input of [audio and ST hypothesis] while data/metric is the automatic quality labels with the input of [gold transcription, gold reference, ST hypothesis]. We also provide a code to calculate the correlations of cascaded SpeechQE systems.

$ python speechqe/calculate_corr_cas.py

End-to-End SpeechQE

The model we provide is trained with two phase steps. First step is to train the model in ST and ASR tasks, only updating the adapter. Second step is to train SpeechQE task while we fix the pre-trained adapter in the previous step and LoRA fine-tuning the TowerInstruct.

Download Common Voice

Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0.

import datasets
cv4en = datasets.load_dataset(
    "mozilla-foundation/common_voice_4_0", "en", cache_dir='path/to/cv4/download',
)

Train

The training code and training corpus will be provided later. However, if you want those quickly, please do not hesitate to ping me ([email protected])!

Eval

We provide SpeechQE benchmark: h-j-han/SpeechQE-CoVoST2. BASE_AUDIO_PATH is the path of downloaded Common Voice dataset. Please refer to ./scripts/eval_mt.sh for full commands.

$ python speechqe/score_speechqe.py \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de \
    --dataset_name=h-j-han/SpeechQE-CoVoST2 \
    --base_audio_path=$BASE_AUDIO_PATH \
    --dataset_config_name=en2de \
    --test_split_name=test_seamlar
 # for simple test run

or

$ ./scripts/score_spechqe.sh

SpeechQE Correlation with Human Direct Assessment Score

We compare the output quality scores from SpeechQE systems with human direct assessment (DA) scores from the IWSLT-ACL test set from IWSLT/da2023.

$ python speechqe/score_speechqe.py \
    --dataroot=data/acl \
    --manifest_files=test_ACL-iwslt23da-humandasc-en2de_fixedinst.tsv \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de

The result of cascaded system on IWSLT-ACL test set and related data can be found in data/acl.

Reference

Please find details in the ACL paper or arXiv paper.

@inproceedings{han-etal-2024-speechqe,
    title = "{S}peech{QE}: Estimating the Quality of Direct Speech Translation",
    author = "Han, HyoJung  and
      Duh, Kevin  and
      Carpuat, Marine",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1218",
    pages = "21852--21867",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
outputs		outputs
scripts		scripts
speechqe		speechqe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechQE: Estimating the Quality of Direct Speech Translation

SpeechQE-CoVoST2: Benchmarks and Training Corpus for SpeechQE

Environment Setup

Cascaded SpeechQE

End-to-End SpeechQE

Download Common Voice

Train

Eval

SpeechQE Correlation with Human Direct Assessment Score

Reference

About

Releases

Packages

Languages

License

h-j-han/SpeechQE

Folders and files

Latest commit

History

Repository files navigation

SpeechQE: Estimating the Quality of Direct Speech Translation

SpeechQE-CoVoST2: Benchmarks and Training Corpus for SpeechQE

Environment Setup

Cascaded SpeechQE

End-to-End SpeechQE

Download Common Voice

Train

Eval

SpeechQE Correlation with Human Direct Assessment Score

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages