vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Important

Pretrained model (with vq-wav2vec as input) and training procedure are released!

Environment

Please refer to environment directory for a requirements.txt and Dockerfile.

In addition, for convenience, we also provide a Docker image for Linux, so you can easily run the Docker container:

docker pull cantabilekwok511/vec2wav2.0:v0.2
docker run -it -v /path/to/vec2wav2.0:/workspace cantabilekwok511/vec2wav2.0:v0.2

Voice Conversion with Pretrained Model

We provide a simple VC interface.

First, please make sure some required models are downloaded in the pretrained/ directory:

vq-wav2vec model from this url
WavLM-Large from this url
Pre-trained vec2wav 2.0 (on vq-wav2vec tokens) from 🤗Huggingface

The resulting directory should look like this:

pretrained/
    - vq-wav2vec_kmeans.pt 
    - WavLM-Large.pt 
    - generator.ckpt
    - config.yml

Then VC can be done by

source path.sh
vc.py -s $source_wav -t $speaker_prompt -o $output_wav

where $source_wav, $speaker_prompt should both be mono-channel audio and preferably .wav files. This script by default tries to load pretrained/generator.ckpt and the corresponding config.yml. You can provide --expdir to change this path.

If you have trained you own model under $expdir, please specify the checkpoint filename:

vc.py -s $source_wav -t $speaker_prompt -o $output_wav \
      --expdir $expdir --checkpoint /path/to/checkpoint.pkl

Web Interface

We also provide a VC web interface using Gradio. To try our online interactive demo: 🤗HuggingFace.

To launch it locally:

# Make sure gradio is installed first
pip install gradio
python vec2wav2/bin/gradio_app.py

This will start a local web server and open the interface in your browser. You can:

Upload source audio (the voice you want to convert)
Upload target speaker audio (the voice you want to convert to)
Click "Convert Voice" to perform the conversion
Listen to or download the converted audio

The web interface uses the same models and settings as the command-line tool.

Training

First, we need to set up data manifests and features. Please refer to ./data_prep.md for a guide on LibriTTS dataset.

Then, please refer to ./train.sh for training. It will automatically launch pytorch DDP training on all the devices in CUDA_VISIBLE_DEVICES. Please change os.environ["MASTER_PORT"] in vec2wav2/bin/train.py if you need.

Decoding (VQ tokens to wav)

If you want to decode VQ features in existing feats.scp into wavs, you can use

decode.py --feats-scp /path/to/feats.scp --prompt-scp /path/to/prompt.scp \
          --checkpoint /path/to/checkpoint.pkl --config /path/to/config.yml \
          --outdir /path/to/output_dir

Here, prompt.scp specifies every utterance (content VQ tokens) and its prompts (WavLM features). It is organized in a similar style with feats.scp.

Citation

@article{guo2024vec2wav,
  title={vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders},
  author={Guo, Yiwei and Li, Zhihan and Li, Junjie and Du, Chenpeng and Wang, Hankun and Wang, Shuai and Chen, Xie and Yu, Kai},
  journal={arXiv preprint arXiv:2409.01995},
  year={2024}
}

🔍 See Also: The vec2wav family

[paper] vec2wav in VQTTS. Single-speaker.
[paper][code] CTX-vec2wav in UniCATS. Multi-speaker with acoustic prompts. Lots of code borrowed from there.
🌟(This) vec2wav 2.0. Enhanced in timbre controllability, best for VC!

Acknowledgements

kan-bayashi/ParallelWaveGAN for the whole project structure.
NVIDIA/BigVGAN for the vocoder backbone.
Kaldi and ESPnet for providing useful tools and Conformer implementation.
Fairseq for some network architectures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Environment

Voice Conversion with Pretrained Model

Web Interface

Training

Decoding (VQ tokens to wav)

Citation

🔍 See Also: The vec2wav family

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Environment

Voice Conversion with Pretrained Model

Web Interface

Training

Decoding (VQ tokens to wav)

Citation

🔍 See Also: The vec2wav family

Acknowledgements