Flowchartof the AI network.
Google API + WER/MER/WIL Metric
pip install SpeechRecognition google-cloud-speech google-api-python-client oauth2client jiwer
Test: cd utils && python speech2text.py
sdr,sir,sar = mir_eval.separation.bss_eval_sources(reference_sources, estimated_sources, compute_permutation=True)
python metric_eva_focus.py -c config/focusTest.yaml -e model/embedder.pt --checkpoint_path ../trained_model/enhance_my_voice/chkpt_201000.pt -o eva-focus -m focus -g 0 -x [noise]-[XdB].xlsx
python metric_eva_hide.py -c config/hideTest.yaml -e model/embedder.pt --checkpoint_path ../trained_model/hide_my_voice/chkpt_304000.pt -o eva-hide -m hide -g 0 -x [noise]-[XdB].xlsx
python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -g 1 -o [output directory]
Note that the checkpoint_path model will affect the model perform either hide or focus voice.
python trainer.py -c [config yaml] -e [path of embedder pt file] -g 1 -l power/mse -m [name] -h 1/0
-h is selected to either train a hide voice model or a focus voice model
Version | Description |
---|---|
V0 | Original Version of voicefilter |
V1.0 | + mixed_wav -> denoised_wav -> stft -> denoised_mag -> loss |
V2.0 | + mixed_wav -> denoised_wav -> stft -> denoised_mag -> loss |
V2.1 | |
V3.0 | |
V3.1 | Apply normalization after mixed_mag - noise_mag |
V3.1.1 | |
V3.1.2 | Change dataloader, get new_target_wav = mixed_wav - target_wav |
V3.2 | |
V3.2.1 | Add 3 different evaluations for wavs based on v3.2 |
V3.2.2 | |
V3.2.3 | Use plus to train hide my voice, add dataloader option for old and new dataset |
Dataset | PATH |
---|---|
train-100 | |
audios after normalize.sh | /srv/node/sdc1/LibriSpeech |
spectrograms after generator.py | |
train-360 | |
audios after normalize.sh | /srv/node/sdc1/medium-LibriSpeech |
spectrograms and phases after v2 generator.py | /srv/node/sdc1/medium-processed-audio |
New dataset | |
New dataset based on train-360 | /srv/node/sdd1/new-processed-audio |
Period | chenning | hanqing |
---|---|---|
0701-0703 | Power loss[x] | Reproduction [x] |
0704-0704 | Code Review [x], dataset Production [x] | |
0705-0708 | Paper introduction draft & pipeline optimization [x] | code v3 [x] |
0709-0711 | System design | Preliminary on public dataset |
0713-0718 | Finish experimental evaluation | Finish experimental evaluation |
0720-0725 | Finish user case 1 | Finish user case 1 |
0727-0801 | Finish user case 2 | Finish user case 2 |
0803-0808 | Paper v1 | Paper v1 |
0809-0814 | Paper submission | Paper submission |
Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.
-
Python and packages
pip install -r requirements.txt
-
Download LibriSpeech dataset
To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/.
train-clear-100.tar.gz
(6.3G) contains speech of 252 speakers, andtrain-clear-360.tar.gz
(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be. -
Resample & Normalize wav files
First, unzip
tar.gz
file to desired folder:tar -xvzf train-clear-360.tar.gz
Next, copy
utils/normalize-resample.sh
to root directory of unzipped data folder. Then:vim normalize-resample.sh # set "N" as your CPU core number. chmod a+x normalize-resample.sh ./normalize-resample.sh # this may take long
-
Edit
config.yaml
cd config cp default.yaml config.yaml vim config.yaml
change train_dir
and test_dir
. Maintain different config.yaml
at desktop and server.
-
Preprocess wav files
In order to boost training speed, perform STFT for each files before training by:
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
This will create 100,000(train) + 1000(test) data. (About 160G)
-
Run
v0 => generator.py
can getmixed_mag
,mixed_wav
,target_mag
,target_wav
,d_vector.txt
. Note thisd_vector.txt
is the path of reference audio. -
Run
v1.0
orv2.0
generator.py
can also getmixed_phase
andtarget_phase
. -
On server side, DO NOT use -p as multi-processor.
-
Run
After specifying
train_dir
,test_dir
atconfig.yaml
, run:python trainer.py -c [config yaml] -e [path of embedder pt file] -g 1 -l power/mse -m [name]
This will create
chkpt/name
andlogs/name
at base directory(-b
option,.
in default)
-
add
-g
to choose cuda device, default is device 1. This arg is required. -
add
-l
to select loss type, default is power loss. Can switch to mse loss by specifying this arg to mse. -
View tensorboardX
tensorboard --logdir ./logs
-
Resuming from checkpoint
python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -g 1 -l power/mse -m name
python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -g 1 -o [output directory]
- Try power-law compressed reconstruction error as loss function, instead of MSE. (See #14)