Skip to content

SpeechQE: Estimating the Quality of Direct Speech Translation at EMNLP2024

License

Notifications You must be signed in to change notification settings

h-j-han/SpeechQE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures.

We provide our E2E model on Huggingface Hub. The provided models corresponds to "TowerInstruct-LoRA+Adapter-pt-Fixed" in the paper.

SpeechQE for E2E Model Trained Domain
English-to-German Speech Translation h-j-han/SpeechQE-TowerInstruct-7B-en2de CoVoST2
Spanish-to-English Speech Translation h-j-han/SpeechQE-TowerInstruct-7B-es2en CoVoST2

SpeechQE-CoVoST2: Benchmarks and Training Corpus for SpeechQE

We subsample about 80k segments from the training set and 500 from the dev and test of facebook/covost2, then run seven different direct ST models to generate the ST hypotheses. So, the test split consists of 3500 instances(500*7). We also provide splits for each translation model.

Environment Setup

$ conda create -n speechqe Python=3.11 pytorch=2.0.1  pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia
$ conda activate speechqe
$ pip install -r requirements.txt

Cascaded SpeechQE

We use Unbabel/XCOMET-XL and google/metricx-23-xl-v2p0 for cascaded SpeechQE systems. ASR system we mainly report in the cascaded system is openai/whisper-large-v3.

We provide all the result data in data folder, where data/cas_speechqe is results of cascaded SpeechQE with the input of [audio and ST hypothesis] while data/metric is the automatic quality labels with the input of [gold transcription, gold reference, ST hypothesis]. We also provide a code to calculate the correlations of cascaded SpeechQE systems.

$ python speechqe/calculate_corr_cas.py

End-to-End SpeechQE

The model we provide is trained with two phase steps. First step is to train the model in ST and ASR tasks, only updating the adapter. Second step is to train SpeechQE task while we fix the pre-trained adapter in the previous step and LoRA fine-tuning the TowerInstruct.

Download Common Voice

Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0.

import datasets
cv4en = datasets.load_dataset(
    "mozilla-foundation/common_voice_4_0", "en", cache_dir='path/to/cv4/download',
)

Train

The training code and training corpus will be provided later. However, if you want those quickly, please do not hesitate to ping me ([email protected])!

Eval

We provide SpeechQE benchmark: h-j-han/SpeechQE-CoVoST2. BASE_AUDIO_PATH is the path of downloaded Common Voice dataset. Please refer to ./scripts/eval_mt.sh for full commands.

$ python speechqe/score_speechqe.py \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de \
    --dataset_name=h-j-han/SpeechQE-CoVoST2 \
    --base_audio_path=$BASE_AUDIO_PATH \
    --dataset_config_name=en2de \
    --test_split_name=test_seamlar
 # for simple test run

or

$ ./scripts/score_spechqe.sh

SpeechQE Correlation with Human Direct Assessment Score

We compare the output quality scores from SpeechQE systems with human direct assessment (DA) scores from the IWSLT-ACL test set from IWSLT/da2023.

$ python speechqe/score_speechqe.py \
    --dataroot=data/acl \
    --manifest_files=test_ACL-iwslt23da-humandasc-en2de_fixedinst.tsv \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de

The result of cascaded system on IWSLT-ACL test set and related data can be found in data/acl.

Reference

Please find details in the ACL paper or arXiv paper.

@inproceedings{han-etal-2024-speechqe,
    title = "{S}peech{QE}: Estimating the Quality of Direct Speech Translation",
    author = "Han, HyoJung  and
      Duh, Kevin  and
      Carpuat, Marine",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1218",
    pages = "21852--21867",
}

About

SpeechQE: Estimating the Quality of Direct Speech Translation at EMNLP2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published