Korean Speech to English Translation Corpus
- Speech files
- Train/Dev/Test filenames' list their English translation
Provided under request (in this link)
- Korean scripts
- Other metadata (for StyleKQC and Covid-ED)
git clone https://github.com/warnikchow/kosp2e
cd kosp2e
cd data
wget https://www.dropbox.com/s/y74ew1c1evdoxs1/data.zip
unzip data.zip
Then you get the folder with speech files (data and subfolders) and split files' list (split and .xlsx files).
Dataset | License | Domain | Characteristics | Volume (Train / Dev / Test) |
Tokens (ko / en) |
Speakers (Total) |
---|---|---|---|---|---|---|
Zeroth | CC-BY 4.0 | News / newspaper | DB originally for speech recognition |
22,263 utterances (3,004 unique scripts) (21,589 / 197 / 461) |
72K / 120K | 115 |
KSS | CC-BY-NC-SA 4.0 | Textbook (colloquial descriptions) |
Originally recorded by a single speaker (multi-speaker recording augmented) |
25,708 utterances = 12,854 * 2 (recording augmented) (24,940 / 256 / 512) |
64K / 95K | 17 |
StyleKQC | CC-BY-SA 4.0 | AI agent (commands) |
Speech act (4) and topic (6) labels are included |
30,000 utterances (28,800 / 400 / 800) |
237K / 391K | 60 |
Covid-ED | CC-BY-NC-SA 4.0 | Diary (monologue) |
Sentences are in document level; emotion tags included |
32,284 utterances (31,324 / 333 / 627) |
358K / 571K | 71 |
- The total number of .wav files in Zeroth dataset does not match with the total number of translation pairs that are provided, since some of the examples were excluded in the corpus construction to guarantee the data quality. However, to maintain files of the original Zeroth dataset, we did not delete them from the .wav files folder. The preprocessing and data loading is not affected by the difference of file list.
Model | BLEU | WER (ASR) |
BLEU (MT/ST) |
---|---|---|---|
ASR-MT (Pororo) | 16.6 | 34.0 | 18.5 (MT) |
ASR-MT (PAPAGO) | 21.3 | 34.0 | 25.0 (MT) |
Transformer (Vanilla) | 2.6 | - | - |
ASR pretraining | 5.9 | 24.0* | - |
Transformer + Warm-up | 8.7 | - | 35.7 (ST)* |
+ Fine-tuning | 18.3 | - | - |
- Some of the numerics differ from the paper (after fixing some errors), but may not influence the results much.
- Fairseq is required for the basic recipe. You may install specific fairseq version for replication.
wget https://github.com/pytorch/fairseq/archive/148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
unzip 148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
pip install -e ./fairseq-148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3/
pip install -r requirements.txt
- First, you preprocess the data, and then prepare them in a format that fit with transformer. Other part follows fairseq S2T translation recipe with MuST-C.
- This recipe leads you to the Vanilla model (the most basic end-to-end version). For the advanced training, refer to the paper below.
python preprocessing.py
python prep_data.py --data-root dataset/ --task st --vocab-type unigram --vocab-size 8000
fairseq-train dataset/kr-en --config-yaml config_st.yaml \
--train-subset train_st --valid-subset dev_st --save-dir result --num-workers 4 \
--max-tokens 40000 --max-update 50000 --task speech_to_text \
--criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16
This work was supported by PAPAGO, NAVER Corp. The authors appreciate Hyoung-Gyu Lee, Eunjeong Lucy Park, Jihyung Moon, and Doosun Yoo for discussions and support. Also, the authors thank Taeyoung Jo, Kyubyong Park, and Yoon Kyung Lee for sharing the resources.
Copyright 2021-present NAVER Corp.
License of each subcorpus (including metadata and Korean script) follows the original license of the raw corpus. For KSS and Covid-ED, only academic usage is permitted.
@inproceedings{cho21b_interspeech,
author={Won Ik Cho and Seok Min Kim and Hyunchang Cho and Nam Soo Kim},
title={{kosp2e: Korean Speech to English Translation Corpus}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={3705--3709},
doi={10.21437/Interspeech.2021-1040}
}
arXiv version is here.
Contact Won Ik Cho [email protected] for further question.