-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the prior loss and MAS algorithm #18
Comments
I think this loss is good for:
|
@cantabile-kwok
So, in contrast with Glow-TTS where the analogue of our encoder loss L_enc has clear probabilistic interpretation (it is one of the terms used to calculate log-likelihood optimized during training), in Grad-TTS the encoder should just output mu having the two properties mentioned above. You can consider the encoder output to be a Gaussian distribution (leading to weighted L_2 loss between mu and y), or you can just optimize any other distance between mu and y, and it may also work well. This is one of the differences between Glow-TTS and Grad-TTS: in our model the choice of encoder loss L_enc does not affect the diffusion loss L_diff (they are sort of "independent"), while in Glow-TTS there is a single NLL loss with the analogue of our encoder loss being one of its terms having a clear probabilistic interpretation (i.e. log of the prior). |
Great work! I've been studying the paper and the code recently and there's something that confuses me much.
In my understanding, the encoder outputs some Gaussian distributions with different
mu
for each phoneme, and the DPM decoder recovers mel-specy
from these Gaussians. Hencey
is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood ofy
in the Gaussian distribution ofmu
. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted aslog_prior
in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They usez
to evaluate the Gaussian likelihood with meanmu
, andz
is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, asz
is Gaussian by itself.The text was updated successfully, but these errors were encountered: