Replies: 3 comments
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> dkreutz |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> erogol
[March 9, 2020, 3:07pm]
So there are different TTS libraries out there and I see they all use
different normalization methods for spectrogram normalization in model
training.
Right now in Mozilla TTS what we do is the following;
20 curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz np.log10(np.maximum(min_level, x))
slash [-4,4 slash ] assuming minimum db is -100
github.com
#### mozilla/TTS/blob/master/utils/audio.py slash #L151
<br/> def apply_preemphasis(self, x):<br/> if self.preemphasis == 0:<br/> raise RuntimeError(' [!] Preemphasis is set 0.0.')<br/> return scipy.signal.lfilter([1, -self.preemphasis], [1], x)<br/> <br/> def apply_inv_preemphasis(self, x):<br/> if self.preemphasis == 0:<br/> raise RuntimeError(' [!] Preemphasis is set 0.0.')<br/> return scipy.signal.lfilter([1], [1, -self.preemphasis], x)<br/> <br/> def spectrogram(self, y):<br/> if self.preemphasis != 0:<br/> D = self._stft(self.apply_preemphasis(y))<br/> else:<br/> D = self._stft(y)<br/> S = self._amp_to_db(np.abs(D)) - self.ref_level_db<br/> return self._normalize(S)<br/> <br/> def melspectrogram(self, y):<br/> if self.preemphasis != 0:<br/> D = self._stft(self.apply_preemphasis(y))<br/>
First of all, does anyone see any problem here?
The only obvious thing above is that preemphasis operation is hard to
recover if you do batch inference. There is no straight forward
implementation of it in CUDA since de-preemphasis operation has a
temporal dependency in itself. Y slash
ou can approximate it by using RNN layers but it is slow. Hence, I guess
it makes sense to drop preemphasis. This also makes our model
incompatible with the latest vocoder models.
I see NVIDIA Tacotron implementation does not use any normalization
except the amp_to_db operation
github.com
#### NVIDIA/tacotron2/blob/master/audio_processing.py slash #L78
<br/> angles = angles.astype(np.float32)<br/> angles = torch.autograd.Variable(torch.from_numpy(angles))<br/> signal = stft_fn.inverse(magnitudes, angles).squeeze(1)<br/> <br/> for i in range(n_iters):<br/> _, angles = stft_fn.transform(signal)<br/> signal = stft_fn.inverse(magnitudes, angles).squeeze(1)<br/> return signal<br/> <br/> def dynamic_range_compression(x, C=1, clip_val=1e-5):<br/> '''<br/> PARAMS<br/> ------<br/> C: compression factor<br/> '''<br/> return torch.log(torch.clamp(x, min=clip_val) curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz C)<br/> <br/> def dynamic_range_decompression(x, C=1):<br/> '''<br/>
ESPNet, on the other hand, uses Standardization with mean and variance.
It is good to compute normalization parameters from the target dataset
(as in image recognition) to make the normalization flexible among
different datasets. However, the downside is that each frequency level
attains the same level of consideration by the model. I don't think it
is the right thing to do since for speech different frequency levels
signify different aspects of the speech. Another downside is that in a
multi-speaker model we need to compute mean-var stats separately per
speaker. Which is a viable but an additional headache.
Also, I saw that using Standardization enables better vocoder models,
especially with the new GAN based models.
I also started to experiment with Standardization and saw that the
training seems more stable but the GL based results sound worse.
So I guess the better option is to use Standardization with a trained
vocoder and our current normalization flow for GL.
These are all I know and I am kind of confused here. Please let me know
if you have any take in this issue.
[This is an archived TTS discussion thread from discourse.mozilla.org/t/does-anyone-have-any-reasonable-intutition-about-the-normalization-method-used-for-the-spectrograms-in-tts-training]
Beta Was this translation helpful? Give feedback.
All reactions