kosp2e

Korean Speech to English Translation Corpus

Dataset

Freely available

Speech files
Train/Dev/Test filenames' list their English translation

Provided under request (in this link)

Korean scripts
Other metadata (for StyleKQC and Covid-ED)

Howto

git clone https://github.com/warnikchow/kosp2e
cd kosp2e
cd data
wget https://www.dropbox.com/s/y74ew1c1evdoxs1/data.zip
unzip data.zip

Then you get the folder with speech files (data and subfolders) and split files' list (split and .xlsx files).

Specification

Dataset	License	Domain	Characteristics	Volume (Train / Dev / Test)	Tokens (ko / en)	Speakers (Total)
Zeroth	CC-BY 4.0	News / newspaper	DB originally for speech recognition	22,263 utterances (3,004 unique scripts) (21,589 / 197 / 461)	72K / 120K	115
KSS	CC-BY-NC-SA 4.0	Textbook (colloquial descriptions)	Originally recorded by a single speaker (multi-speaker recording augmented)	25,708 utterances = 12,854 * 2 (recording augmented) (24,940 / 256 / 512)	64K / 95K	17
StyleKQC	CC-BY-SA 4.0	AI agent (commands)	Speech act (4) and topic (6) labels are included	30,000 utterances (28,800 / 400 / 800)	237K / 391K	60
Covid-ED	CC-BY-NC-SA 4.0	Diary (monologue)	Sentences are in document level; emotion tags included	32,284 utterances (31,324 / 333 / 627)	358K / 571K	71

The total number of .wav files in Zeroth dataset does not match with the total number of translation pairs that are provided, since some of the examples were excluded in the corpus construction to guarantee the data quality. However, to maintain files of the original Zeroth dataset, we did not delete them from the .wav files folder. The preprocessing and data loading is not affected by the difference of file list.

Baseline

Model	BLEU	WER (ASR)	BLEU (MT/ST)
ASR-MT (Pororo)	16.6	34.0	18.5 (MT)
ASR-MT (PAPAGO)	21.3	34.0	25.0 (MT)
Transformer (Vanilla)	2.6	-	-
ASR pretraining	5.9	24.0*	-
Transformer + Warm-up	8.7	-	35.7 (ST)*
+ Fine-tuning	18.3	-	-

Some of the numerics differ from the paper (after fixing some errors), but may not influence the results much.

Recipe

Fairseq is required for the basic recipe. You may install specific fairseq version for replication.

wget https://github.com/pytorch/fairseq/archive/148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
unzip 148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3.zip
pip install -e ./fairseq-148327d8c1e3a5f9d17a11bbb1973a7cf3f955d3/

pip install -r requirements.txt

First, you preprocess the data, and then prepare them in a format that fit with transformer. Other part follows fairseq S2T translation recipe with MuST-C.
This recipe leads you to the Vanilla model (the most basic end-to-end version). For the advanced training, refer to the paper below.

python preprocessing.py

python prep_data.py --data-root dataset/ --task st --vocab-type unigram --vocab-size 8000

fairseq-train dataset/kr-en  --config-yaml config_st.yaml \
--train-subset train_st --valid-subset dev_st --save-dir result --num-workers 4 \
--max-tokens 40000 --max-update 50000 --task speech_to_text \
--criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16

Acknowledgement

This work was supported by PAPAGO, NAVER Corp. The authors appreciate Hyoung-Gyu Lee, ‪Eunjeong Lucy Park, Jihyung Moon, and Doosun Yoo for discussions and support.‬ Also, the authors thank Taeyoung Jo, Kyubyong Park, and Yoon Kyung Lee for sharing the resources.

Copyright

Copyright 2021-present NAVER Corp.

License

License of each subcorpus (including metadata and Korean script) follows the original license of the raw corpus. For KSS and Covid-ED, only academic usage is permitted.

Citation

@inproceedings{cho21b_interspeech,
  author={Won Ik Cho and Seok Min Kim and Hyunchang Cho and Nam Soo Kim},
  title={{kosp2e: Korean Speech to English Translation Corpus}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3705--3709},
  doi={10.21437/Interspeech.2021-1040}
}

arXiv version is here.

Contact

Contact Won Ik Cho [email protected] for further question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kosp2e

Dataset

Freely available

Provided under request (in this link)

Howto

Specification

Baseline

Recipe

Acknowledgement

Copyright

License

Citation

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
split		split
README.md		README.md
prep_data.py		prep_data.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt

warnikchow/kosp2e

Folders and files

Latest commit

History

Repository files navigation

kosp2e

Dataset

Freely available

Provided under request (in this link)

Howto

Specification

Baseline

Recipe

Acknowledgement

Copyright

License

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages