Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 3.64 KB

data_prep.md

File metadata and controls

44 lines (34 loc) · 3.64 KB

Data Preparation

In this repo, we rely on the Kaldi-style data formatting. We take the LibriTTS (including clean and other partitions) for example.

data/ directory: data manifests

We have organized the data directory containing all the LibriTTS data. Here are the steps to establish the data dir.

  1. Please download from here (about 5MB; or here for Mainland Chinese users), and unzip it to data in the project root. Every sub-directory contains utt2spk, spk2utt and wav.scp files. They are all plain texts, with <key> <value> in each line.
  2. As we are using the 16kHz version of LibriTTS, please down-sample the speech data if you don't have them.
  3. Then, change the paths in wav.scp to the correct ones in your machine.

feats/ directory: speech features

We include three types of speech features in CTX-vec2wav. They should all be stored in feats/ directory in project root.

  • VQ index (together with codebook) from vq-wav2vec. We extracted it by fairseq, and we provide the extracted VQ index sequences with codebook online.
    1. Please download from here (460MB; here for Chinese users).
    2. Unzip it to feats/vqidx, and change the corresponding paths in the feats.scp.
    3. You can check out the feature shape by feat-to-shape.py scp:feats/vqidx/eval_all/feats.scp | head. The shapes should be (frames, 2).
  • PPE auxiliary features. PPE stands for probability of voice, pitch and energy (all in log scale). We extracted them using Kaldi and, to avoid you from installing Kaldi, we provide the extracted and normalized features online.
    1. Please download from here (570MB; here for Chinese users).
    2. Similarly, please unzip it to feats/normed_ppe, and change the corresponding paths in feats.scp.
    3. Check: the shapes of these features should be (frames, 3).
  • Mel spectrograms (FBanks). As they are too large, we provide a script to extract them locally:
    nj=64  # parallel jobs. Set this according to your CPU cores.
    bash extract_fbank.sh --nj $nj --stage 0 --stop_stage 1  # Default: 80-dim with 10ms frame shift
    # Stage 0 extracts fbank in parallel. Stage 1 performs normalization.

This will create feats/fbank and feats/normed_fbank each about 16GB. You can delete feats/fbank after normalization (just it would be better if you keep the train_all/cmvn.ark there).

Finally, you have correctly formatted the data for training!