This guide is designed to help with understanding and maintaining this project. If you wish to make changes to a feature, please read this first.
Everyone is welcome to make modifications to the project and propose their changes in pull requests. Please make sure that you understand and have tested your changes before sharing.
To make a change create a Pull Request with a title explaining & description listing the changes. Also ensure the following:
- You have tested the changes and are confident existing functionality has not been impacted
- You have formatted the code using black with the command
black . -l 120
- The feature works in the executable (which can be built with the command below)
Before building the executable you will need to create a python environment with the requirements installed (requirements.txt
) as well as pyinstaller
.
You can then run the build script with python build_exe.py
.
This will add the executable to the dist
folder.
To run the unit tests you will need to download the Test samples zip and extract to a directory called test_samples
within the project.
You will also need to install pytest
.
You can then run tests with the command pytest
.
The frontend application is build using flask
and flask-socketio
.
main.py
starts the app and opens the browser but the majority of the app is handled in the application
folder.
check_ffmpeg.py
checks that an ffmpeg install is found and if not will install. For linux it runs the install command. For windows it downloads and extracts an ffmpeg zip folder. This could fail if the URL is no longer supported.
views.py
contains all of the endpoints. Depending on the complexity of the task it may call start_progress_thread
. This takes a function and runs it in a background thread using flask-socketio
. These tasks must do the following to be supported by start_progress_thread
:
- The function must write info to the passed logger (called
logging
) - To update the progress bar it must write a log in the format "Progress - x/y" where x is the current iteration and y is the total iterations (i.e. "Progress - 25/100")
- To pin a message it must write a log in the format "Status - text" where text is the message (i.e. "Status - Score is 0.5")
The Frontend handles messsages from the thread in application.js
.
If an error occurs, this will be sent in an "error" message to the frontend. When complete, the handler will send a "done" message to the frontend which shows the "next" button.
Please note: The function called inside this thread cannot create other threads or the application may crash.
The dataset builder uses a range of libraries including pydub
, librosa
, torch
, wave
and webrtcvad
.
The main entry for the builder is dataset/create_dataset.py
. It does 3 things:
- Converts the audio to consistent format using
FFmpeg
inaudio_processing.py
- Generates clips using
clip_generator.py
- Generates an info file using
analysis.py
The forced alignment process used in clip_generator.py
can be found in the forced_alignment
folder and is based on https://github.com/mozilla/DSAlign. This library is able to take the source audio and text and split into clips.
It uses a TranscriptionModel
object from transcribe.py
(currently supporting silero or deepspeech) to convert speech-to-text and will delete clips which do not meet a minimum similarity score.
clip_generator.py
will also remove clips with an invalid duration and save the resulting filenames and text to a metadata file (typically called "metadata.csv") in the format filename|text (i.e. "clip_1.wav|Hello, how are you?").
extend_existing_dataset.py
uses clip_generator.py
to extend an existing dataset, and adds a suffix to filenames to differentiate sources.
The training script implements a modified version https://github.com/NVIDIA/tacotron2.
Found in train.py
it add a few additions the existing project did not have:
- It ensures CUDA is enabled before starting. This is required for
torch
to use the GPU and is essential for this model. - It automatically calculates what batch size & learning rate to use depending on the available GPU memory. This is only a conservative estimate so can be tweaked, but is a useful starting point for inexperienced users
- It can automatically search the output model folder to find a checkpoint to start training from. This is what enables the easy start-stop functionality, so that users can continue training from where they left off
- It can enable early-stopping which will stop training if the loss over the last 10 checkpoints has not sufficently decreased (minimum loss reached)
The synthesis script implements https://github.com/jik876/hifi-gan.
The synthesis process is implemented in synthesize.py
. It firstly loads the feature predictor model (from training) and a pretrained vocoder model (hifi-gan). It then cleans the text and infers the results. Audio & an alignment graph can be produced from this.