About training Resolution #7

lxd941213 · 2024-06-18T07:42:37Z

Hi, great work! I would like to ask some details about the CA-VAE training. I saw in your paper that CA-VAE trained in “9 × 256 × 256 and 17 × 192 × 192”. If it is trained at such a low resolution, will the quality be worse if it is inferred at 512 or 768 resolution? Looking forward to your reply, thank you!

ryancll · 2024-06-19T10:12:34Z

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

Tord-Zhang · 2024-06-19T11:11:20Z

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

I have also tested CV-VAE and tried finetuning my UNET on it, while it can keep better temporal consistency, the detail is rather worse compared to 2D VAE.

sijeh · 2024-06-19T11:12:47Z

256x256 is sufficient for training VAE, since VAE of SD2.1 is also trained at this resolution. The loss of VAE in high-frequency information (such as fine textures and intense motion) is mainly due to the use of 4 channels in the latent (z=4). 3D VAE has a higher compression ratio compared to 2D VAE, resulting in greater information loss. We are also currently training the SD3 version of CV-VAE. Since SD3's latent uses 16 channels, it has a significant improvement (With the same setting, 31.9dB V.S 28.9dB in PSNR, 0.928 V.S 0.885 in SSIM)compared to the VAE with z=4.

sijeh · 2024-06-19T11:23:54Z

I've tested the CV-VAE on high-resolution video data and the reconstruction quality is not as good as 2D VAE, especially for some high frequency details like small human face. @sijeh Do you have any plan to release a high-resolution version? If not, can we direcly finetune the model with high-resolution data? (Network capacity releated expriment results will be very instructive to the community). Thank you!

Fine-tuning at high resolutions cannot solve this problem. We have already tried further fine-tuning at 320x320x17, but the reconstruction performance cannot be effectively improved. The reconstruction loss mainly comes from the z=4 latent used in SD2.1's VAE, and the 3D VAE has a 4x higher information compression ratio than the 2D VAE. Using a z=16 3D VAE will achieve a significant improvement.

ryancll · 2024-06-19T12:23:49Z

@sijeh Thank you! Very useful information!

radna0 · 2024-07-06T18:48:52Z

Is it possible to get access to the z=16 SD3 version of CV-VAE? @sijeh

lxd941213 changed the title ~~About training~~ About training Resolution Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About training Resolution #7

About training Resolution #7

lxd941213 commented Jun 18, 2024

ryancll commented Jun 19, 2024

Tord-Zhang commented Jun 19, 2024

sijeh commented Jun 19, 2024 •

edited

Loading

sijeh commented Jun 19, 2024

ryancll commented Jun 19, 2024

radna0 commented Jul 6, 2024 •

edited

Loading

About training Resolution #7

About training Resolution #7

Comments

lxd941213 commented Jun 18, 2024

ryancll commented Jun 19, 2024

Tord-Zhang commented Jun 19, 2024

sijeh commented Jun 19, 2024 • edited Loading

sijeh commented Jun 19, 2024

ryancll commented Jun 19, 2024

radna0 commented Jul 6, 2024 • edited Loading

sijeh commented Jun 19, 2024 •

edited

Loading

radna0 commented Jul 6, 2024 •

edited

Loading