In this repo, we rely on the Kaldi-style data formatting.
We take the LibriTTS (including clean
and other
partitions) for example.
We have organized the data
directory containing all the LibriTTS data. Here are the steps to establish the data
dir.
- Please download from here (about 5MB; or here for Mainland Chinese users), and unzip it to
data
in the project root. Every sub-directory containsutt2spk, spk2utt
andwav.scp
files. They are all plain texts, with<key> <value>
in each line. - As we are using the 16kHz version of LibriTTS, please down-sample the speech data if you don't have them.
- Then, change the paths in
wav.scp
to the correct ones in your machine.
We include three types of speech features in CTX-vec2wav. They should all be stored in feats/
directory in project root.
- VQ index (together with codebook) from vq-wav2vec. We extracted it by fairseq, and we provide the extracted VQ index sequences with codebook online.
- PPE auxiliary features. PPE stands for probability of voice, pitch and energy (all in log scale). We extracted them using Kaldi and, to avoid you from installing Kaldi, we provide the extracted and normalized features online.
- Mel spectrograms (FBanks). As they are too large, we provide a script to extract them locally:
nj=64 # parallel jobs. Set this according to your CPU cores. bash extract_fbank.sh --nj $nj --stage 0 --stop_stage 1 # Default: 80-dim with 10ms frame shift # Stage 0 extracts fbank in parallel. Stage 1 performs normalization.
This will create feats/fbank
and feats/normed_fbank
each about 16GB. You can delete feats/fbank
after normalization (just it would be better if you keep the train_all/cmvn.ark
there).
Finally, you have correctly formatted the data for training!