You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.
According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?
The text was updated successfully, but these errors were encountered:
Usually, the term "latent" used in the context of diffusion modeling denotes the space where forward and reverse diffusions are defined, i.e. if the clean image/spectrogram is x_0, then its noisy versions x_t can be called "latents". The paper you mentioned uses the term "latent" in this meaning. In Grad-TTS, score-matching network is parameterized with UNet, but its encoder does not provide "latents" in the mentioned meaning. So, diffusion in Grad-TTS does not take place in the space of the outputs of the UNet encoder, but UNet itself (encoder + decoder) maps noisy object x_t in the "latent" space to the score function at x_t.
I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.
According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?
The text was updated successfully, but these errors were encountered: