Does anyone have any reasonable intutition about the normalization method used for the spectrograms in tts training #157

JRMeyer · 2021-03-07T08:25:56Z

JRMeyer
Mar 7, 2021
Maintainer

>>> erogol
[March 9, 2020, 3:07pm]

So there are different TTS libraries out there and I see they all use
different normalization methods for spectrogram normalization in model
training.

Right now in Mozilla TTS what we do is the following;

20 curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz np.log10(np.maximum(min_level, x))
slash [-4,4 slash ] assuming minimum db is -100

github.com

#### mozilla/TTS/blob/master/utils/audio.py slash #L151

 def apply_preemphasis(self, x): if self.preemphasis == 0: raise RuntimeError(' [!] Preemphasis is set 0.0.') return scipy.signal.lfilter([1, -self.preemphasis], [1], x) def apply_inv_preemphasis(self, x): if self.preemphasis == 0: raise RuntimeError(' [!] Preemphasis is set 0.0.') return scipy.signal.lfilter([1], [1, -self.preemphasis], x) def spectrogram(self, y): if self.preemphasis != 0: D = self._stft(self.apply_preemphasis(y)) else: D = self._stft(y) S = self._amp_to_db(np.abs(D)) - self.ref_level_db return self._normalize(S) def melspectrogram(self, y): if self.preemphasis != 0: D = self._stft(self.apply_preemphasis(y)) 

First of all, does anyone see any problem here?

The only obvious thing above is that preemphasis operation is hard to
recover if you do batch inference. There is no straight forward
implementation of it in CUDA since de-preemphasis operation has a
temporal dependency in itself. Y slash
ou can approximate it by using RNN layers but it is slow. Hence, I guess
it makes sense to drop preemphasis. This also makes our model
incompatible with the latest vocoder models.

I see NVIDIA Tacotron implementation does not use any normalization
except the amp_to_db operation

github.com

#### NVIDIA/tacotron2/blob/master/audio_processing.py slash #L78

 angles = angles.astype(np.float32) angles = torch.autograd.Variable(torch.from_numpy(angles)) signal = stft_fn.inverse(magnitudes, angles).squeeze(1) for i in range(n_iters): _, angles = stft_fn.transform(signal) signal = stft_fn.inverse(magnitudes, angles).squeeze(1) return signal def dynamic_range_compression(x, C=1, clip_val=1e-5): ''' PARAMS ------ C: compression factor ''' return torch.log(torch.clamp(x, min=clip_val) curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz C) def dynamic_range_decompression(x, C=1): ''' 

ESPNet, on the other hand, uses Standardization with mean and variance.
It is good to compute normalization parameters from the target dataset
(as in image recognition) to make the normalization flexible among
different datasets. However, the downside is that each frequency level
attains the same level of consideration by the model. I don't think it
is the right thing to do since for speech different frequency levels
signify different aspects of the speech. Another downside is that in a
multi-speaker model we need to compute mean-var stats separately per
speaker. Which is a viable but an additional headache.

Also, I saw that using Standardization enables better vocoder models,
especially with the new GAN based models.

I also started to experiment with Standardization and saw that the
training seems more stable but the GL based results sound worse.

So I guess the better option is to use Standardization with a trained
vocoder and our current normalization flow for GL.

These are all I know and I am kind of confused here. Please let me know
if you have any take in this issue.

[This is an archived TTS discussion thread from discourse.mozilla.org/t/does-anyone-have-any-reasonable-intutition-about-the-normalization-method-used-for-the-spectrograms-in-tts-training]

JRMeyer · 2021-03-07T08:25:59Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> nmstoker
[March 15, 2020, 4:32pm]

This isn't something I know about particularly but I did start reading
around to see what others discussed etc. It might help to visualise the
impact of different approaches on the spectrograms / audio waves before
& after.

Regarding dropping the pre-emphasis there were mentions in the context
of speech recognition that it possibly wasn't necessary any longer:
https://www.quora.com/Why-is-pre-emphasis-i-e-passing-the-speech-signal-through-a-first-order-high-pass-filter-required-in-speech-processing-and-how-does-it-work slash
I'd be inclined to test it w/o, and it sounds like others most not be
doing it too, given what you say about other vocoders.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:26:02Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> dkreutz
[March 15, 2020, 8:10pm]

Yes, Pre-emphasis can be skipped.

Regarding normalization see
https://en.wikipedia.org/wiki/Audio_normalization

From what I understand the peak normalization approach is used here.
Probably a loudness normalization (e.g. RMS) would be more feasible as
the average level 'seen by the algorithm' is more constant?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:26:04Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[March 17, 2020, 11:28am]

the problem with preemphasis is practical. Yes, it improves the results
but it is hard to apply in inference time since there is no GPU
implementation.

My latest experiments showed mean-var normalization works a bit better
than the other methods. Now I am adding it to our repo as another
alternative. You can also still use preemphasis as you like. I don't
drop the support for it.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does anyone have any reasonable intutition about the normalization method used for the spectrograms in tts training #157

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Does anyone have any reasonable intutition about the normalization method used for the spectrograms in tts training #157

JRMeyer Mar 7, 2021 Maintainer

Replies: 3 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author