Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on some details in the paper #216

Open
thanhtvt opened this issue Dec 14, 2024 · 2 comments
Open

Questions on some details in the paper #216

thanhtvt opened this issue Dec 14, 2024 · 2 comments

Comments

@thanhtvt
Copy link

Congratulations on your amazing work! This open-source project is truly a significant contribution to the community. I have a few questions about certain aspects of the paper and would greatly appreciate any clarification from you or anyone else who might have answers:

  1. Equation 5: Since x_1 represents the full-resolution latent and x_0 is the lowest-resolution latent, could you explain the rationale behind applying a downsampling function to x_0? This aspect is a bit unclear to me
  2. Equation 6: Why do you downsample x_{s_k} by 2^{k+1} and then upsample it afterward? Would there be a specific reason for not directly downsampling x_{s_k} by 2^k instead?
  3. Selection of s_k and e_k: Could you elaborate on how s_k and e_k are chosen? I read your ICLR 2025 rebuttal regarding the time windows, but I'm still unclear about the normalized timestep. Specifically, if your framework comprises four stages, could you specify the range for each time window?

I appreciate any support from you all!!

@feifeiobama
Copy link
Collaborator

  1. $x_0$ is full-resolution noise (with the same resolution as $x_1$), we apply downsampling to obtain low-resolution noise.
  2. It's to align with the inference. It first inferences at lower-resolution pyramid stage, and then performs some kind of upsampling, resulting in a pixelated latent.
  3. Great question! For $K=4$, we specify the time windows as $[0, \frac{1}{4}], [\frac{1}{7}, \frac{1}{2}], [\frac{1}{3}, \frac{3}{4}], [\frac{3}{5}, 1]$

@thanhtvt
Copy link
Author

Oh, right, thanks for your clarification. Figure 1b confused me into thinking that $$x_0$$ is originally the lowest-resolution noise, rather than the full-resolution one being downsampled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants