Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VITS recipe for LibriTTS corpus #1776

Merged
merged 25 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
cbef43f
init commit
JinZr Oct 21, 2024
e0136d9
minor updates
JinZr Oct 21, 2024
2a5aa7c
added VITS recipe
JinZr Oct 21, 2024
f003c1c
Update prepare.sh
JinZr Oct 21, 2024
aba7579
Create shared
JinZr Oct 21, 2024
20e2d5e
minor fixes
JinZr Oct 21, 2024
dc0106a
minor fixes
JinZr Oct 21, 2024
8da9acd
minor updates
JinZr Oct 21, 2024
d99248a
Update prepare_tokens_libritts.py
JinZr Oct 21, 2024
5545bb3
Merge branch 'dev/libritts-tts' of https://github.com/jinzr/icefall i…
JinZr Oct 21, 2024
caa1d41
minor updates to the TTS & CODEC recipes
JinZr Oct 21, 2024
d56f8a7
Merge branch 'dev/libritts-tts' of https://github.com/jinzr/icefall i…
JinZr Oct 21, 2024
3ac1331
minor updates
JinZr Oct 21, 2024
32cdbdf
Update vits.py
JinZr Oct 22, 2024
ca3b495
removed unused imports
JinZr Oct 22, 2024
3c3db1a
minor updates
JinZr Oct 22, 2024
f34b376
Merge branch 'k2-fsa:master' into dev/libritts-tts
JinZr Oct 22, 2024
f9c4ca9
typo fixed
JinZr Oct 27, 2024
6680a7e
Merge branch 'k2-fsa:master' into dev/libritts-tts
JinZr Oct 29, 2024
63822ce
Merge branch 'dev/libritts-tts' of https://github.com/jinzr/icefall i…
JinZr Oct 29, 2024
31ebbf5
Merge branch 'k2-fsa:master' into dev/libritts-tts
JinZr Oct 29, 2024
fe1498a
Merge branch 'dev/libritts-tts' of https://github.com/jinzr/icefall i…
JinZr Oct 30, 2024
87c6f01
Merge branch 'k2-fsa:master' into dev/libritts-tts
JinZr Oct 30, 2024
9f79e21
added pre-trained model
JinZr Oct 30, 2024
589c245
Merge branch 'dev/libritts-tts' of https://github.com/jinzr/icefall i…
JinZr Oct 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,7 @@ We provide a Colab notebook to test the pre-trained model: [![Open In Colab](htt

- [LJSpeech][ljspeech]
- [VCTK][vctk]
- [LibriTTS][libritts_tts]

### Supported Models

Expand Down Expand Up @@ -372,6 +373,7 @@ Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-bad
[commonvoice]: egs/commonvoice/ASR
[csj]: egs/csj/ASR
[libricss]: egs/libricss/SURT
[libritts_asr]: egs/libritts/ASR
[libriheavy]: egs/libriheavy/ASR
[mgb2]: egs/mgb2/ASR
[spgispeech]: egs/spgispeech/ASR
Expand All @@ -380,3 +382,4 @@ Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-bad

[vctk]: egs/vctk/TTS
[ljspeech]: egs/ljspeech/TTS
[libritts_tts]: egs/libritts/TTS
18 changes: 9 additions & 9 deletions egs/libritts/CODEC/encodec/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def get_parser():
parser.add_argument(
"--save-every-n",
type=int,
default=1,
default=5,
help="""Save checkpoint after processing this number of epochs"
periodically. We save checkpoint to exp-dir/ whenever
params.cur_epoch % save_every_n == 0. The checkpoint filename
Expand Down Expand Up @@ -1093,14 +1093,14 @@ def run(rank, world_size, args):
rank=rank,
)

# if not params.print_diagnostics:
# scan_pessimistic_batches_for_oom(
# model=model,
# train_dl=train_dl,
# optimizer_g=optimizer_g,
# optimizer_d=optimizer_d,
# params=params,
# )
if not params.print_diagnostics:
scan_pessimistic_batches_for_oom(
model=model,
train_dl=train_dl,
optimizer_g=optimizer_g,
optimizer_d=optimizer_d,
params=params,
)

scaler = GradScaler(enabled=params.use_fp16, init_scale=1.0)
if checkpoints and "grad_scaler" in checkpoints:
Expand Down
5 changes: 2 additions & 3 deletions egs/libritts/CODEC/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,11 @@ if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
# to $dl_dir/LibriTTS
mkdir -p data/manifests
if [ ! -e data/manifests/.libritts.done ]; then
lhotse prepare libritts --num-jobs 32 $dl_dir/LibriTTS data/manifests
lhotse prepare libritts --num-jobs ${nj} $dl_dir/LibriTTS data/manifests
touch data/manifests/.libritts.done
fi
fi


if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
log "Stage 2: Compute Spectrogram for LibriTTS"
mkdir -p data/spectrogram
Expand All @@ -64,7 +63,7 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
if [ ! -f data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz ]; then
cat <(gunzip -c data/spectrogram/libritts_cuts_train-clean-100.jsonl.gz) \
<(gunzip -c data/spectrogram/libritts_cuts_train-clean-360.jsonl.gz) \
<(gunzip -c /data/spectrogramlibritts_cuts_train-other-500.jsonl.gz) | \
<(gunzip -c data/spectrogramlibritts_cuts_train-other-500.jsonl.gz) | \
shuf | gzip -c > data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz
fi

Expand Down
51 changes: 51 additions & 0 deletions egs/libritts/TTS/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Introduction

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
The main differences from the LibriSpeech corpus are listed below:
1. The audio files are at 24kHz sampling rate.
2. The speech is split at sentence breaks.
3. Both original and normalized texts are included.
4. Contextual information (e.g., neighbouring sentences) can be extracted.
5. Utterances with significant background noise are excluded.
For more information, refer to the paper "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech", Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, arXiv, 2019. If you use the LibriTTS corpus in your work, please cite this paper where it was introduced.

> [!CAUTION]
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
> While these recipes has the potential to advance various fields such as accessibility, language education, and AI-driven solutions, it also carries certain ethical and legal responsibilities.
>
> By using this framework, you agree to the following:
> 1. Legal and Ethical Use: You shall not use this framework, or any models derived from it, for any unlawful or unethical purposes. This includes, but is not limited to: Creating voice clones without the explicit, informed consent of the individual whose voice is being cloned. Engaging in any form of identity theft, impersonation, or fraud using cloned voices. Violating any local, national, or international laws regarding privacy, intellectual property, or personal data.
>
> 2. Responsibility of Use: The users of this framework are solely responsible for ensuring that their use of voice cloning technologies complies with all applicable laws and ethical guidelines. We explicitly disclaim any liability for misuse of the technology.
>
> 3. Attribution and Use of Open-Source Components: This project is provided under the Apache 2.0 license. Users must adhere to the terms of this license and provide appropriate attribution when required.
>
> 4. No Warranty: This framework is provided “as-is,” without warranty of any kind, either express or implied. We do not guarantee that the use of this software will comply with legal requirements or that it will not infringe the rights of third parties.


# VITS

This recipe provides a VITS model trained on the LibriTTS dataset.

Pretrained model can be found [here](https://huggingface.co/zrjin/icefall-tts-libritts-vits-2024-10-30).

The training command is given below:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./vits/train.py \
--world-size 4 \
--num-epochs 400 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--max-duration 500
```

To inference, use:
```
./vits/infer.py \
--exp-dir vits/exp \
--epoch 400 \
--tokens data/tokens.txt
```
1 change: 1 addition & 0 deletions egs/libritts/TTS/local/compute_spectrogram_libritts.py
1 change: 1 addition & 0 deletions egs/libritts/TTS/local/prepare_token_file.py
89 changes: 89 additions & 0 deletions egs/libritts/TTS/local/prepare_tokens_libritts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env python3
# Copyright 2023 Xiaomi Corp. (authors: Zengwei Yao,
# Zengrui Jin,)
# 2024 Tsinghua University (authors: Zengrui Jin,)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This file reads the texts in given manifest and save the new cuts with phoneme tokens.
"""

import logging
from pathlib import Path

import tacotron_cleaner.cleaners
from lhotse import CutSet, load_manifest
from piper_phonemize import phonemize_espeak
from tqdm.auto import tqdm


def remove_punc_to_upper(text: str) -> str:
text = text.replace("‘", "'")
text = text.replace("’", "'")
tokens = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'")
s_list = [x.upper() if x in tokens else " " for x in text]
s = " ".join("".join(s_list).split()).strip()
return s


def prepare_tokens_libritts():
output_dir = Path("data/spectrogram")
prefix = "libritts"
suffix = "jsonl.gz"
partitions = (
"dev-clean",
"dev-other",
"test-clean",
"test-other",
"train-all-shuf",
"train-clean-460",
# "train-clean-100",
# "train-clean-360",
# "train-other-500",
)

for partition in partitions:
cut_set = load_manifest(output_dir / f"{prefix}_cuts_{partition}.{suffix}")

new_cuts = []
for cut in tqdm(cut_set):
# Each cut only contains one supervision
assert len(cut.supervisions) == 1, (len(cut.supervisions), cut)
text = cut.supervisions[0].text
# Text normalization
text = tacotron_cleaner.cleaners.custom_english_cleaners(text)
# Convert to phonemes
tokens_list = phonemize_espeak(text, "en-us")
tokens = []
for t in tokens_list:
tokens.extend(t)
cut.tokens = tokens
cut.supervisions[0].normalized_text = remove_punc_to_upper(text)

new_cuts.append(cut)

new_cut_set = CutSet.from_cuts(new_cuts)
new_cut_set.to_file(
output_dir / f"{prefix}_cuts_with_tokens_{partition}.{suffix}"
)


if __name__ == "__main__":
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
logging.basicConfig(format=formatter, level=logging.INFO)

prepare_tokens_libritts()
1 change: 1 addition & 0 deletions egs/libritts/TTS/local/validate_manifest.py
134 changes: 134 additions & 0 deletions egs/libritts/TTS/prepare.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
#!/usr/bin/env bash

# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

set -eou pipefail

stage=0
stop_stage=100
sampling_rate=24000
nj=32

dl_dir=$PWD/download

. shared/parse_options.sh || exit 1

# All files generated by this script are saved in "data".
# You can safely remove "data" and rerun this script to regenerate it.
mkdir -p data

log() {
# This function is from espnet
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

log "dl_dir: $dl_dir"

if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
log "Stage -1: build monotonic_align lib"
if [ ! -d vits/monotonic_align/build ]; then
cd vits/monotonic_align
python setup.py build_ext --inplace
cd ../../
else
log "monotonic_align lib already built"
fi
fi

if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
log "Stage 0: Download data"

# If you have pre-downloaded it to /path/to/LibriTTS,
# you can create a symlink
#
# ln -sfv /path/to/LibriTTS $dl_dir/LibriTTS
#
if [ ! -d $dl_dir/LibriTTS ]; then
lhotse download libritts $dl_dir
fi

if [ ! -d $dl_dir/xvector_nnet_1a_libritts_clean_460 ]; then
log "Downloading x-vector"

git clone https://huggingface.co/datasets/zrjin/xvector_nnet_1a_libritts_clean_460 $dl_dir/xvector_nnet_1a_libritts_clean_460

mkdir -p exp/xvector_nnet_1a/
cp -r $dl_dir/xvector_nnet_1a_libritts_clean_460/* exp/xvector_nnet_1a/
fi

fi

if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
log "Stage 1: Prepare LibriTTS manifest"
# We assume that you have downloaded the LibriTTS corpus
# to $dl_dir/LibriTTS
mkdir -p data/manifests
if [ ! -e data/manifests/.libritts.done ]; then
lhotse prepare libritts --num-jobs ${nj} $dl_dir/LibriTTS data/manifests
touch data/manifests/.libritts.done
fi
fi

if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
log "Stage 2: Compute Spectrogram for LibriTTS"
mkdir -p data/spectrogram
if [ ! -e data/spectrogram/.libritts.done ]; then
./local/compute_spectrogram_libritts.py --sampling-rate $sampling_rate
touch data/spectrogram/.libritts.done
fi

# Here we shuffle and combine the train-clean-100, train-clean-360 and
# train-other-500 together to form the training set.
if [ ! -f data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz ]; then
cat <(gunzip -c data/spectrogram/libritts_cuts_train-clean-100.jsonl.gz) \
<(gunzip -c data/spectrogram/libritts_cuts_train-clean-360.jsonl.gz) \
<(gunzip -c data/spectrogramlibritts_cuts_train-other-500.jsonl.gz) | \
shuf | gzip -c > data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz
fi

# Here we shuffle and combine the train-clean-100, train-clean-360
# together to form the training set.
if [ ! -f data/spectrogram/libritts_cuts_train-clean-460.jsonl.gz ]; then
cat <(gunzip -c data/spectrogram/libritts_cuts_train-clean-100.jsonl.gz) \
<(gunzip -c data/spectrogram/libritts_cuts_train-clean-360.jsonl.gz) | \
shuf | gzip -c > data/spectrogram/libritts_cuts_train-clean-460.jsonl.gz
fi

if [ ! -e data/spectrogram/.libritts-validated.done ]; then
log "Validating data/spectrogram for LibriTTS"
./local/validate_manifest.py \
data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz
touch data/spectrogram/.libritts-validated.done
fi
fi

if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
log "Stage 3: Prepare phoneme tokens for LibriTTS"
# We assume you have installed piper_phonemize and espnet_tts_frontend.
# If not, please install them with:
# - piper_phonemize:
# refer to https://github.com/rhasspy/piper-phonemize,
# could install the pre-built wheels from https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5
# - espnet_tts_frontend:
# `pip install espnet_tts_frontend`, refer to https://github.com/espnet/espnet_tts_frontend/
if [ ! -e data/spectrogram/.libritts_with_token.done ]; then
./local/prepare_tokens_libritts.py
touch data/spectrogram/.libritts_with_token.done
fi
fi

if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
log "Stage 4: Generate token file"
# We assume you have installed piper_phonemize and espnet_tts_frontend.
# If not, please install them with:
# - piper_phonemize:
# refer to https://github.com/rhasspy/piper-phonemize,
# could install the pre-built wheels from https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5
# - espnet_tts_frontend:
# `pip install espnet_tts_frontend`, refer to https://github.com/espnet/espnet_tts_frontend/
if [ ! -e data/tokens.txt ]; then
./local/prepare_token_file.py --tokens data/tokens.txt
fi
fi
1 change: 1 addition & 0 deletions egs/libritts/TTS/shared
1 change: 1 addition & 0 deletions egs/libritts/TTS/vits/duration_predictor.py
1 change: 1 addition & 0 deletions egs/libritts/TTS/vits/flow.py
1 change: 1 addition & 0 deletions egs/libritts/TTS/vits/generator.py
1 change: 1 addition & 0 deletions egs/libritts/TTS/vits/hifigan.py
Loading
Loading