Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurred during extracting mel-spectrogram #11

Open
wlsdbtjr opened this issue Apr 9, 2024 · 11 comments
Open

Error occurred during extracting mel-spectrogram #11

wlsdbtjr opened this issue Apr 9, 2024 · 11 comments

Comments

@wlsdbtjr
Copy link

wlsdbtjr commented Apr 9, 2024

Thank you for your interesting and valuable research.
I'm having trouble running the following command in terminal:

bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16

The sampling rate of the original ljspepech dataset is 22050Hz, but an error seems to have occurred in the process of downsampling it to 16kHz.

This is the error message written in 'exp/make_fbank/ljspeech/train/make_fbank_train.*.log'.

`Traceback (most recent call last):
  File "path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 105, in <module>
    main()
  File "/path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 86, in main
    assert rate == args.fs
AssertionError
# Accounting: time=2 threads=1
# Ended (code 1) at Tue 09 Apr 2024 02:01:15 AM UTC, elapsed time 2 seconds`

Thank you.

@cantabile-kwok
Copy link
Member

cantabile-kwok commented Apr 9, 2024

In this case, please change the following line in extract_fbank.sh to match your sampling rate (22050Hz).

https://github.com/X-LANCE/VoiceFlow-TTS/blob/248c822fd34270b44d4664a68ce2f6a177980f27/extract_fbank.sh#L5C1-L5C48

@wlsdbtjr
Copy link
Author

wlsdbtjr commented Apr 9, 2024

Thank you for your response. However, even after modifying the code as suggested, I encountered an issue where the duration and mel shape did not match during training.

The solution was just converting all data to 16kHz before training, as described in your paper. Thank you.

@wlsdbtjr wlsdbtjr closed this as completed Apr 9, 2024
@wlsdbtjr wlsdbtjr reopened this Apr 9, 2024
@cantabile-kwok
Copy link
Member

Oh this is because the change of sampling rate will lead to the proportional change of frame shift and frame length accordingly. Sorry I forgot about that earlier. The provided durations are correspondent to the current setting in sampling rates, frame shifts and frame lengths, so they cannot be directly used with different configurations.

Glad to hear that you solved the problem by downsampling. If you have any other problems, feel free to open them.

@kelvinqin
Copy link

kelvinqin commented Apr 29, 2024

Thank you for your interesting and valuable research.

In my experiment, I also did the down-sampling (22050->16000),

Then run:
bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 4

And then run:
python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur

But then I got the following complain:
AssertionError: Frame length mismatch: utt LJ012-0035, dur: 443, mel: 447

The only solution is to skip line 187 of data_loader.py, I am not sure if this is fine? Thanks!

@cantabile-kwok
Copy link
Member

@kelvinqin I believe your process is correct. This mismatch is also a common thing to notice in my other experiments. Since the difference between frames is only 4 (eq. 64ms in this setting), we can still tolerate this, because the durations and mel-spectrograms come from different programs and their framing algorithm might be slightly different. In this case, a common approach is to truncate the mel-spectrograms to the length of the duration. You can add some tolerance threshold to see if the mel length <= duration sum + tolerance; if so, then just discard the last several frames of mel.

But just skipping this line might still be unsafe, because in training, the upsampled text conditions still need to meet the length of the mel sequence. So adding the above truncating process would be better.

@kelvinqin
Copy link

@cantabile-kwok thanks so much for your suggestion, I will follow that in my experiments, :-)
Kelvin

@NathanWalt
Copy link

Thank you for your explanation of the reasons of the problems, which I encountered myself.
I'm trying to train and test your model on 22.05kHz data for comparison with other models, so I'm afraid the mismatch could affect the model's performance. Is there a neat solution to the mismatch problem, like adjusting the parameter of MFA?

@cantabile-kwok
Copy link
Member

cantabile-kwok commented Jul 18, 2024

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this:
Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

@NathanWalt
Copy link

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this: Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

Thank you for your advice. I've set the parameters for the extract_fbank.sh as you mentioned and use the english_us_arpa pretrained model for MFA (The process is similar to the one in https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e, except that I use the transcipt in the metadata.csv file and the original radios in 22.05kHz). However, there is still some weird mismatch: the duration of all phonemes obtained in MFA is about 3 to 8 frames longer than the mel spectrogram generated by extract_fbank.sh. I've adopted truncation for the moment. I wonder whether you've encountered such problem before.

@cantabile-kwok
Copy link
Member

@NathanWalt Hmm, I've experienced the length mismatch, but the mismatch is not as large as 8 frames (in mine case usually 2-3 frames). If the your parameters are correctly set, then I guess truncation might still work in your case.

@NathanWalt
Copy link

@cantabile-kwok Thanks for your patience and advice! I'll adopt truncation and see what happens after training the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants