Important
Pretrained model (with vq-wav2vec as input) and training procedure are released!
Please refer to environment
directory for a requirements.txt
and Dockerfile
.
In addition, for convenience, we also provide a Docker image for Linux, so you can easily run the Docker container:
docker pull cantabilekwok511/vec2wav2.0:v0.2
docker run -it -v /path/to/vec2wav2.0:/workspace cantabilekwok511/vec2wav2.0:v0.2
We provide a simple VC interface.
First, please make sure some required models are downloaded in the pretrained/
directory:
- vq-wav2vec model from this url
- WavLM-Large from this url
- Pre-trained vec2wav 2.0 (on vq-wav2vec tokens) from 🤗Huggingface
The resulting directory should look like this:
pretrained/
- vq-wav2vec_kmeans.pt
- WavLM-Large.pt
- generator.ckpt
- config.yml
Then VC can be done by
source path.sh
vc.py -s $source_wav -t $speaker_prompt -o $output_wav
where $source_wav, $speaker_prompt
should both be mono-channel audio and preferably .wav
files.
This script by default tries to load pretrained/generator.ckpt
and the corresponding config.yml
. You can provide --expdir
to change this path.
If you have trained you own model under $expdir, please specify the checkpoint filename:
vc.py -s $source_wav -t $speaker_prompt -o $output_wav \
--expdir $expdir --checkpoint /path/to/checkpoint.pkl
We also provide a VC web interface using Gradio. To try our online interactive demo: 🤗HuggingFace.
To launch it locally:
# Make sure gradio is installed first
pip install gradio
python vec2wav2/bin/gradio_app.py
This will start a local web server and open the interface in your browser. You can:
- Upload source audio (the voice you want to convert)
- Upload target speaker audio (the voice you want to convert to)
- Click "Convert Voice" to perform the conversion
- Listen to or download the converted audio
The web interface uses the same models and settings as the command-line tool.
First, we need to set up data manifests and features. Please refer to ./data_prep.md
for a guide on LibriTTS dataset.
Then, please refer to ./train.sh
for training. It will automatically launch pytorch DDP training on all the devices in CUDA_VISIBLE_DEVICES
. Please change os.environ["MASTER_PORT"]
in vec2wav2/bin/train.py
if you need.
If you want to decode VQ features in existing feats.scp
into wavs, you can use
decode.py --feats-scp /path/to/feats.scp --prompt-scp /path/to/prompt.scp \
--checkpoint /path/to/checkpoint.pkl --config /path/to/config.yml \
--outdir /path/to/output_dir
Here, prompt.scp
specifies every utterance (content VQ tokens) and its prompts (WavLM features). It is organized in a similar style with feats.scp
.
@article{guo2024vec2wav,
title={vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders},
author={Guo, Yiwei and Li, Zhihan and Li, Junjie and Du, Chenpeng and Wang, Hankun and Wang, Shuai and Chen, Xie and Yu, Kai},
journal={arXiv preprint arXiv:2409.01995},
year={2024}
}
- [paper] vec2wav in VQTTS. Single-speaker.
- [paper][code] CTX-vec2wav in UniCATS. Multi-speaker with acoustic prompts. Lots of code borrowed from there.
- 🌟(This) vec2wav 2.0. Enhanced in timbre controllability, best for VC!
- kan-bayashi/ParallelWaveGAN for the whole project structure.
- NVIDIA/BigVGAN for the vocoder backbone.
- Kaldi and ESPnet for providing useful tools and Conformer implementation.
- Fairseq for some network architectures.