Replies: 1 comment 4 replies
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
>>> happycry
[May 11, 2019, 9:08am]
Hi,
I'm attempting to train Tacotron2 (from the dev-tacotron2 branch) using
multiple GPUs. Using 4 V100's, it seems that the steps per seconds is
slower than training on a single gpu. This is my config:
{
'run_name': 'moz',
'run_description': 'Train from scratch',
'audio':{
// Audio processing parameters
'num_mels': 80, // size of the mel spec frame.
'num_freq': 1025, // number of stft frequency levels. Size of the linear spectogram frame.
'sample_rate': 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
'frame_length_ms': 50, // stft window length in ms.
'frame_shift_ms': 12.5, // stft window hop-lengh in ms.
'preemphasis': 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
'min_level_db': -100, // normalization range
'ref_level_db': 20, // reference level db, theoretically 20db is the sound of air.
'power': 1.5, // value to sharpen wav signals after GL algorithm.
'griffin_lim_iters': 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Normalization parameters
'signal_norm': true, // normalize the spec values in range [0, 1]
'symmetric_norm': false, // move normalization to range [-1, 1]
'max_norm': 1, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
'clip_norm': true, // clip normalized values into the range.
'mel_fmin': 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
'mel_fmax': 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
'do_trim_silence': true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},
'distributed':{
'backend': 'nccl',
'url': 'tcp: slash / slash /localhost:54321'
},
'reinit_layers': [],
'model': 'Tacotron2', // one of the model in models/
'grad_clip': 1, // upper limit for gradients for clipping.
'epochs': 1000, // total number of epochs to train.
'lr': 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
'lr_decay': false, // if true, Noam learning rate decaying is applied through training.
'warmup_steps': 4000, // Noam decay steps to increase the learning rate from 0 to 'lr'
'windowing': false, // Enables attention windowing. Used only in eval mode.
'memory_size': 5, // ONLY TACOTRON - memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5.
'attention_norm': 'softmax', // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
'prenet_type': 'bn', // ONLY TACOTRON2 - 'original' or 'bn'.
'use_forward_attn': true, // ONLY TACOTRON2 - if it uses forward attention. In general, it aligns faster.
'transition_agent': false, // ONLY TACOTRON2 - enable/disable transition agent of forward attention.
'loss_masking': false, // enable / disable loss masking against the sequence padding.
'enable_eos_bos_chars': true, // enable/disable beginning of sentence and end of sentence chars.
'batch_size': 32, // Batch size for training. Lower values than 32 might cause hard to learn attention.
'eval_batch_size':16,
'r': 1, // Number of frames to predict for step.
'wd': 0.000001, // Weight decay weight.
'checkpoint': true, // If true, it saves checkpoints per 'save_step'
'save_step': 1000, // Number of training steps expected to save traning stats and checkpoints.
'print_step': 10, // Number of steps to log traning on console.
'tb_model_param_stats': true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
'batch_group_size': 8, //Number of batches to shuffle after bucketing.
'run_eval': true,
'test_delay_epochs': 100, //Until attention is aligned, testing only wastes computation time.
'data_path': '/home/TTS/LJSpeech-1.1', // DATASET-RELATED: can overwritten from command argument
'meta_file_train': 'metadata_train.csv', // DATASET-RELATED: metafile for training dataloader.
'meta_file_val': 'metadata_val.csv', // DATASET-RELATElD: metafile for evaluation dataloader.
'dataset': 'ljspeech', // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use 'tts_cache' for pre-computed dataset by extract_features.py
'min_seq_len': 0, // DATASET-RELATED: minimum text length to use in training
'max_seq_len': 150, // DATASET-RELATED: maximum text length
'output_path': '/home/TTS/ljspeech_models', // DATASET-RELATED: output path for all training outputs.
'num_loader_workers': 8, // number of training data loader processes. Don't set it too big. 4-8 are good values.
'num_val_loader_workers': 4, // number of evaluation data loader processes.
'phoneme_cache_path': 'ljspeech_phonemes', // phoneme computation is slow, therefore, it caches results in the given folder.
'use_phonemes': true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
'phoneme_language': 'en-us', // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
'text_cleaner': 'phoneme_cleaners'
}
slash | slash > Step:6/71 GlobalStep:50 TotalLoss:0.54487 PostnetLoss:0.43788
DecoderLoss:0.10699 StopLoss:0.66825 GradNorm:0.21360 GradNormST:0.77358
AvgTextLen:46.2 AvgSpecLen:226.4 StepTime:6.46 LR:0.000100 slash
slash | slash > Step:16/71 GlobalStep:60 TotalLoss:0.53199 PostnetLoss:0.44408
DecoderLoss:0.08792 StopLoss:0.65621 GradNorm:0.20161 GradNormST:0.78307
AvgTextLen:63.0 AvgSpecLen:317.3 StepTime:8.86 LR:0.000100
Any reasons why this might be happening?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/slow-distributed-training]
Beta Was this translation helpful? Give feedback.
All reactions