Error occurred during extracting mel-spectrogram #11

wlsdbtjr · 2024-04-09T02:02:53Z

Thank you for your interesting and valuable research.
I'm having trouble running the following command in terminal:

bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16

The sampling rate of the original ljspepech dataset is 22050Hz, but an error seems to have occurred in the process of downsampling it to 16kHz.

This is the error message written in 'exp/make_fbank/ljspeech/train/make_fbank_train.*.log'.

`Traceback (most recent call last):
  File "path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 105, in <module>
    main()
  File "/path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 86, in main
    assert rate == args.fs
AssertionError
# Accounting: time=2 threads=1
# Ended (code 1) at Tue 09 Apr 2024 02:01:15 AM UTC, elapsed time 2 seconds`

Thank you.

The text was updated successfully, but these errors were encountered:

cantabile-kwok · 2024-04-09T02:27:35Z

In this case, please change the following line in extract_fbank.sh to match your sampling rate (22050Hz).

https://github.com/X-LANCE/VoiceFlow-TTS/blob/248c822fd34270b44d4664a68ce2f6a177980f27/extract_fbank.sh#L5C1-L5C48

wlsdbtjr · 2024-04-09T05:38:02Z

Thank you for your response. However, even after modifying the code as suggested, I encountered an issue where the duration and mel shape did not match during training.

The solution was just converting all data to 16kHz before training, as described in your paper. Thank you.

cantabile-kwok · 2024-04-09T05:49:17Z

Oh this is because the change of sampling rate will lead to the proportional change of frame shift and frame length accordingly. Sorry I forgot about that earlier. The provided durations are correspondent to the current setting in sampling rates, frame shifts and frame lengths, so they cannot be directly used with different configurations.

Glad to hear that you solved the problem by downsampling. If you have any other problems, feel free to open them.

kelvinqin · 2024-04-29T17:30:42Z

Thank you for your interesting and valuable research.

In my experiment, I also did the down-sampling (22050->16000),

Then run:
bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 4

And then run:
python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur

But then I got the following complain:
AssertionError: Frame length mismatch: utt LJ012-0035, dur: 443, mel: 447

The only solution is to skip line 187 of data_loader.py, I am not sure if this is fine? Thanks!

cantabile-kwok · 2024-04-30T04:03:54Z

@kelvinqin I believe your process is correct. This mismatch is also a common thing to notice in my other experiments. Since the difference between frames is only 4 (eq. 64ms in this setting), we can still tolerate this, because the durations and mel-spectrograms come from different programs and their framing algorithm might be slightly different. In this case, a common approach is to truncate the mel-spectrograms to the length of the duration. You can add some tolerance threshold to see if the mel length <= duration sum + tolerance; if so, then just discard the last several frames of mel.

But just skipping this line might still be unsafe, because in training, the upsampled text conditions still need to meet the length of the mel sequence. So adding the above truncating process would be better.

kelvinqin · 2024-04-30T15:00:17Z

@cantabile-kwok thanks so much for your suggestion, I will follow that in my experiments, :-)
Kelvin

NathanWalt · 2024-07-18T10:52:18Z

Thank you for your explanation of the reasons of the problems, which I encountered myself.
I'm trying to train and test your model on 22.05kHz data for comparison with other models, so I'm afraid the mismatch could affect the model's performance. Is there a neat solution to the mismatch problem, like adjusting the parameter of MFA?

cantabile-kwok · 2024-07-18T13:44:56Z

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this:
Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

NathanWalt · 2024-07-21T03:22:12Z

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this: Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

Thank you for your advice. I've set the parameters for the extract_fbank.sh as you mentioned and use the english_us_arpa pretrained model for MFA (The process is similar to the one in https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e, except that I use the transcipt in the metadata.csv file and the original radios in 22.05kHz). However, there is still some weird mismatch: the duration of all phonemes obtained in MFA is about 3 to 8 frames longer than the mel spectrogram generated by extract_fbank.sh. I've adopted truncation for the moment. I wonder whether you've encountered such problem before.

cantabile-kwok · 2024-07-21T04:32:41Z

@NathanWalt Hmm, I've experienced the length mismatch, but the mismatch is not as large as 8 frames (in mine case usually 2-3 frames). If the your parameters are correctly set, then I guess truncation might still work in your case.

NathanWalt · 2024-07-22T07:11:38Z

@cantabile-kwok Thanks for your patience and advice! I'll adopt truncation and see what happens after training the model.

wlsdbtjr closed this as completed Apr 9, 2024

wlsdbtjr reopened this Apr 9, 2024

cantabile-kwok mentioned this issue Sep 23, 2024

about training #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occurred during extracting mel-spectrogram #11

Error occurred during extracting mel-spectrogram #11

wlsdbtjr commented Apr 9, 2024

cantabile-kwok commented Apr 9, 2024 •

edited

Loading

wlsdbtjr commented Apr 9, 2024

cantabile-kwok commented Apr 9, 2024

kelvinqin commented Apr 29, 2024 •

edited

Loading

cantabile-kwok commented Apr 30, 2024

kelvinqin commented Apr 30, 2024

NathanWalt commented Jul 18, 2024

cantabile-kwok commented Jul 18, 2024 •

edited

Loading

NathanWalt commented Jul 21, 2024

cantabile-kwok commented Jul 21, 2024

NathanWalt commented Jul 22, 2024

Error occurred during extracting mel-spectrogram #11

Error occurred during extracting mel-spectrogram #11

Comments

wlsdbtjr commented Apr 9, 2024

cantabile-kwok commented Apr 9, 2024 • edited Loading

wlsdbtjr commented Apr 9, 2024

cantabile-kwok commented Apr 9, 2024

kelvinqin commented Apr 29, 2024 • edited Loading

cantabile-kwok commented Apr 30, 2024

kelvinqin commented Apr 30, 2024

NathanWalt commented Jul 18, 2024

cantabile-kwok commented Jul 18, 2024 • edited Loading

NathanWalt commented Jul 21, 2024

cantabile-kwok commented Jul 21, 2024

NathanWalt commented Jul 22, 2024

cantabile-kwok commented Apr 9, 2024 •

edited

Loading

kelvinqin commented Apr 29, 2024 •

edited

Loading

cantabile-kwok commented Jul 18, 2024 •

edited

Loading