-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error occurred during extracting mel-spectrogram #11
Comments
In this case, please change the following line in |
Thank you for your response. However, even after modifying the code as suggested, I encountered an issue where the duration and mel shape did not match during training. The solution was just converting all data to 16kHz before training, as described in your paper. Thank you. |
Oh this is because the change of sampling rate will lead to the proportional change of frame shift and frame length accordingly. Sorry I forgot about that earlier. The provided durations are correspondent to the current setting in sampling rates, frame shifts and frame lengths, so they cannot be directly used with different configurations. Glad to hear that you solved the problem by downsampling. If you have any other problems, feel free to open them. |
Thank you for your interesting and valuable research. In my experiment, I also did the down-sampling (22050->16000), Then run: And then run: But then I got the following complain: The only solution is to skip line 187 of data_loader.py, I am not sure if this is fine? Thanks! |
@kelvinqin I believe your process is correct. This mismatch is also a common thing to notice in my other experiments. Since the difference between frames is only 4 (eq. 64ms in this setting), we can still tolerate this, because the durations and mel-spectrograms come from different programs and their framing algorithm might be slightly different. In this case, a common approach is to truncate the mel-spectrograms to the length of the duration. You can add some tolerance threshold to see if the mel length <= duration sum + tolerance; if so, then just discard the last several frames of mel. But just skipping this line might still be unsafe, because in training, the upsampled text conditions still need to meet the length of the mel sequence. So adding the above truncating process would be better. |
@cantabile-kwok thanks so much for your suggestion, I will follow that in my experiments, :-) |
Thank you for your explanation of the reasons of the problems, which I encountered myself. |
@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this: Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA. |
Thank you for your advice. I've set the parameters for the |
@NathanWalt Hmm, I've experienced the length mismatch, but the mismatch is not as large as 8 frames (in mine case usually 2-3 frames). If the your parameters are correctly set, then I guess truncation might still work in your case. |
@cantabile-kwok Thanks for your patience and advice! I'll adopt truncation and see what happens after training the model. |
Thank you for your interesting and valuable research.
I'm having trouble running the following command in terminal:
bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16
The sampling rate of the original ljspepech dataset is 22050Hz, but an error seems to have occurred in the process of downsampling it to 16kHz.
This is the error message written in 'exp/make_fbank/ljspeech/train/make_fbank_train.*.log'.
Thank you.
The text was updated successfully, but these errors were encountered: