This code is an implementation of the paper 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', except 'WAVENET'. The algorithm is based on the following papers:
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2017). Generalized end-to-end loss for speaker verification. arXiv preprint arXiv:1710.10467.
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv preprint arXiv:1806.04558.
Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.
The model is divided into three parts that are learned independently of each other: speaker embedding, tacotron 2, and vocoder. Of these, there are two types of vocoder can be attached: the Tacotron 1 style and Waveglow.
Currently uploaded code is compatible with the following datasets. The O mark to the left of the dataset name is the dataset actually used in the uploaded result.
[X] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[O] VoxCeleb: http://www.openslr.org/12/
[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[O] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
Any voice wav files can be used.
[X] LJSpeech: https://keithito.com/LJ-Speech-Dataset/
[O] VCTK: https://datashare.is.ed.ac.uk/handle/10283/2651
[X] LibriSpeech: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
[X] Tedlium: http://www.openslr.org/12/
[O] TIMIT: http://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3
Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameter.py' according to your environment.
python -m Speaker_Embedding.Pattern_Generate [options]
option list:
-vctk <path> Set the path of VCTK. VCTK's patterns are generated.
-ls <path> Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-vox1 <path> Set the path of VoxCeleb1. VoxCeleb1's patterns are generated.
-vox2 <path> Set the path of VoxCeleb2. VoxCeleb2's patterns are generated.
Set inference files path while training for verification. Edit 'Speaker_Embedding_Inference_in_Train.txt'
python -m Speaker_Embedding.Speaker_Embedding
python -m Taco1_Mel_to_Spect.Pattern_Generate [options]
option list:
-vctk <path> Set the path of VCTK. VCTK's patterns are generated.
-ls <path> Set the path of LibriSpeech. LibriSpeech's patterns are generated.
Set inference files path while training for verification. Edit 'Mel_to_Spect_Inference_in_Train.txt'
python -m Taco1_Mel_to_Spect.Taco1_Mel_to_Spect
There is no pattern generate step. Waveglow use wav file directly as patterns.
Set inference files path while training for verification. Edit 'WaveGlow_Inference_File_Path_in_Train.txt'
python -m WaveGlow.WaveGlow
python Pattern_Generate.py [options]
option list:
-lj <path> Set the path of LJSpeech. LJSpeech's patterns are generated.
-vctk <path> Set the path of VCTK. VCTK's patterns are generated.
-ls <path> Set the path of LibriSpeech. LibriSpeech's patterns are generated.
-tl <path> Set the path of Tedlium. Tedlium's patterns are generated.
-timit <path> Set the path of TIMIT. TIMIT's patterns are generated.
-all All save option. Generator ignore the 'Use_Wav_Length_Range' hyper parameter. If this option is not set, only patterns matching 'Use_Wav_Length_Range' will be generated.
Set inference files path and sentence while training for verification. Edit 'Inference_Sentence_in_Train.txt'
python MSTTS_SV.py
from MSTTS_SV import Tacotron2
new_Tacotron2 = Tacotron2(is_Training= False)
new_Tacotron2.Restore()
path_List = [
'E:/Multi_Speaker_TTS.Raw_Data/LJSpeech/wavs/LJ040-0143.wav',
'E:/Multi_Speaker_TTS.Raw_Data/LibriSpeech/train/17/363/17-363-0039.flac',
'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p314/p314_020.wav',
'E:/Multi_Speaker_TTS.Raw_Data/VCTK/wav48/p256/p256_001.wav'
]
text_List = [
'He that has no shame has no conscience.',
'Who knows much believes the less.',
'Things are always at their best in the beginning.',
'Please call Stella.'
]
※Two lists should have same length.
new_Tacotron2.Inference(
path_List = path_List,
text_List = text_List,
file_Prefix = 'Result'
)
Currently, the performance of Waveglow was not good.
Exported wav files: WAV.zip
https://drive.google.com/drive/folders/1wXrJY-gQTOs9yZ7nxvxPaAa6Wf8uF7zP?usp=sharing
Waveglow performance improvment