Different Implementation of Diffusion Model #35

siyag12 · 2023-11-24T07:12:53Z

I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.

According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding.
Could you give a reasoning behind this?

li1jkdaw · 2024-08-23T17:55:05Z

Usually, the term "latent" used in the context of diffusion modeling denotes the space where forward and reverse diffusions are defined, i.e. if the clean image/spectrogram is x_0, then its noisy versions x_t can be called "latents". The paper you mentioned uses the term "latent" in this meaning. In Grad-TTS, score-matching network is parameterized with UNet, but its encoder does not provide "latents" in the mentioned meaning. So, diffusion in Grad-TTS does not take place in the space of the outputs of the UNet encoder, but UNet itself (encoder + decoder) maps noisy object x_t in the "latent" space to the score function at x_t.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Implementation of Diffusion Model #35

Different Implementation of Diffusion Model #35

siyag12 commented Nov 24, 2023

li1jkdaw commented Aug 23, 2024 •

edited

Loading

Different Implementation of Diffusion Model #35

Different Implementation of Diffusion Model #35

Comments

siyag12 commented Nov 24, 2023

li1jkdaw commented Aug 23, 2024 • edited Loading

li1jkdaw commented Aug 23, 2024 •

edited

Loading