diff --git a/README.md b/README.md index 1d7a20b..1b9ea37 100644 --- a/README.md +++ b/README.md @@ -2,15 +2,15 @@ A curated list of latest research papers, projects and resources related to DiT/FLUX. Content is automatically updated daily. -> Last Update: 2025-01-08 06:29:06 +> Last Update: 2025-01-09 06:28:39 Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. ## Categories - [Image Editing](#image-editing) (18 papers) - Papers about image editing with Diffusion Transformer or FLUX -- [Image Generation](#image-generation) (96 papers) - Papers focusing on image generation with Diffusion Transformer or FLUX -- [Video Related](#video-related) (59 papers) - Papers about video generation and editing with Diffusion Transformer or FLUX +- [Image Generation](#image-generation) (97 papers) - Papers focusing on image generation with Diffusion Transformer or FLUX +- [Video Related](#video-related) (60 papers) - Papers about video generation and editing with Diffusion Transformer or FLUX @@ -33,112 +33,116 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[EliGen: Entity-Level Controlled Image Generation with Regional Attention](https://arxiv.org/abs/2501.01097v1)** (Published: 2025-01-02) Authors: Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.01097v1.pdf) - Keywords: text-to-image, diffusion transformer, image inpainting, image generation, Control + Keywords: image generation, text-to-image, diffusion transformer, Control, image inpainting - **[Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG](https://arxiv.org/abs/2412.09614v1)** (Published: 2024-12-12) Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09614v1.pdf) - Keywords: text-to-image, FLUX, image editing, Control + Keywords: text-to-image, image editing, Control, FLUX - **[FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers](https://arxiv.org/abs/2412.09611v1)** (Published: 2024-12-12) Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09611v1.pdf) - Keywords: image generation, Control, FLUX, image editing, rectified flow + Keywords: image generation, image editing, Control, FLUX, rectified flow - **[AMO Sampler: Enhancing Text Rendering with Overshooting](https://arxiv.org/abs/2411.19415v1)** (Published: 2024-11-28) Authors: Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.19415v1.pdf) - Keywords: text-to-image, image generation, FLUX, Control, rectified flow + Keywords: image generation, text-to-image, Control, FLUX, rectified flow - **[Prediction with Action: Visual Policy Learning via Joint Denoising Process](https://arxiv.org/abs/2411.18179v1)** (Published: 2024-11-27) Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.18179v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://sites.google.com/view/pad-paper) - Keywords: Control, image editing, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image editing, image generation - **[HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads](https://arxiv.org/abs/2411.15034v1)** (Published: 2024-11-22) Authors: Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.15034v1.pdf) - Keywords: image editing, diffusion transformer, image generation + Keywords: diffusion transformer, image editing, image generation - **[Stable Flow: Vital Layers for Training-Free Image Editing](https://arxiv.org/abs/2411.14430v1)** (Published: 2024-11-21) Authors: Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.14430v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://omriavrahami.com/stable-flow) - Keywords: Control, image editing, inversion, diffusion transformer + Keywords: diffusion transformer, image editing, Control, inversion - **[Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method](https://arxiv.org/abs/2411.11135v1)** (Published: 2024-11-17) Authors: Yan Zheng, Zhenxiao Liang, Xiaoyan Cong, Lanqing guo, Yuehao Wang, Peihao Wang, Zhangyang Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.11135v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://yanyanzheng96.github.io/oscillation_inversion/}{this) - Keywords: text-to-image, FLUX, image editing, inversion, rectified flow + Keywords: rectified flow, text-to-image, image editing, FLUX, inversion - **[Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing](https://arxiv.org/abs/2411.08196v1)** (Published: 2024-11-12) Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.08196v1.pdf) - Keywords: Control, image editing, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image editing, image generation - **[Taming Rectified Flow for Inversion and Editing](https://arxiv.org/abs/2411.04746v2)** (Published: 2024-11-07) Authors: Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, Ying Shan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.04746v2.pdf) | [![GitHub](https://img.shields.io/github/stars/wangjiangshan0725/RF-Solver-Edit?style=social)](https://github.com/wangjiangshan0725/RF-Solver-Edit) - Keywords: diffusion transformer, FLUX, inversion, video editing, video generation, rectified flow + Keywords: rectified flow, video generation, diffusion transformer, video editing, FLUX, inversion - **[DiT4Edit: Diffusion Transformer for Image Editing](https://arxiv.org/abs/2411.03286v2)** (Published: 2024-11-05) Authors: Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, Zeyu Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.03286v2.pdf) - Keywords: diffusion transformer, image generation, Control, image editing, inversion + Keywords: image generation, diffusion transformer, image editing, Control, inversion - **[FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model](https://arxiv.org/abs/2410.13925v1)** (Published: 2024-10-17) Authors: ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.13925v1.pdf) | [![GitHub](https://img.shields.io/github/stars/whlzy/FiT?style=social)](https://github.com/whlzy/FiT) - Keywords: rectified flow, diffusion transformer, image generation + Keywords: diffusion transformer, image generation, rectified flow - **[Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations](https://arxiv.org/abs/2410.10792v1)** (Published: 2024-10-14) Authors: Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.10792v1.pdf) - Keywords: Control, FLUX, image editing, inversion, rectified flow + Keywords: rectified flow, image editing, Control, FLUX, inversion - **[Effective Diffusion Transformer Architecture for Image Super-Resolution](https://arxiv.org/abs/2409.19589v1)** (Published: 2024-09-29) Authors: Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2409.19589v1.pdf) - Keywords: image super-resolution, diffusion transformer, image generation + Keywords: diffusion transformer, image super-resolution, image generation - **[PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions](https://arxiv.org/abs/2409.15278v2)** (Published: 2024-09-23) Authors: Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2409.15278v2.pdf) | [![GitHub](https://img.shields.io/github/stars/AFeng-x/PixWizard?style=social)](https://github.com/AFeng-x/PixWizard) - Keywords: text-to-image, Controllable, diffusion transformer, image generation, Control, image editing + Keywords: Controllable, image generation, text-to-image, diffusion transformer, image editing, Control - **[Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing](https://arxiv.org/abs/2408.13335v1)** (Published: 2024-08-23) Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2408.13335v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://anonymous.com/anonymous/EMS-Benchmark,) - Keywords: text-to-image, Control, image editing, diffusion transformer + Keywords: diffusion transformer, image editing, Control, text-to-image - **[Lazy Diffusion Transformer for Interactive Image Editing](https://arxiv.org/abs/2404.12382v1)** (Published: 2024-04-18) Authors: Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2404.12382v1.pdf) - Keywords: image editing, diffusion transformer + Keywords: diffusion transformer, image editing - **[Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models](https://arxiv.org/abs/2312.12540v4)** (Published: 2023-12-19) Authors: Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, Rami Ben-Ari Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2312.12540v4.pdf) - Keywords: text-to-image, FLUX, image editing, inversion + Keywords: text-to-image, image editing, FLUX, inversion ### Image Generation -*Showing the latest 50 out of 96 papers* +*Showing the latest 50 out of 97 papers* +- **[Circuit Complexity Bounds for Visual Autoregressive Model](https://arxiv.org/abs/2501.04299v1)** (Published: 2025-01-08) + Authors: Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song + Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.04299v1.pdf) + Keywords: diffusion transformer, image generation - **[GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking](https://arxiv.org/abs/2501.02690v1)** (Published: 2025-01-05) Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.02690v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://wkbian.github.io/Projects/GS-DiT/.) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[EliGen: Entity-Level Controlled Image Generation with Regional Attention](https://arxiv.org/abs/2501.01097v1)** (Published: 2025-01-02) Authors: Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.01097v1.pdf) - Keywords: text-to-image, diffusion transformer, image inpainting, image generation, Control + Keywords: image generation, text-to-image, diffusion transformer, Control, image inpainting - **[Dual Diffusion for Unified Image Generation and Understanding](https://arxiv.org/abs/2501.00289v1)** (Published: 2024-12-31) Authors: Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, Peng Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.00289v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation + Keywords: diffusion transformer, image generation, text-to-image - **[Open-Sora: Democratizing Efficient Video Production for All](https://arxiv.org/abs/2412.20404v1)** (Published: 2024-12-29) Authors: Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.20404v1.pdf) | [![GitHub](https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social)](https://github.com/hpcaitech/Open-Sora) - Keywords: text-to-image, diffusion transformer, image generation, video generation + Keywords: diffusion transformer, image generation, text-to-image, video generation - **[UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation](https://arxiv.org/abs/2412.18928v1)** (Published: 2024-12-25) Authors: Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Mingming Gong, Gui-Song Xia Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.18928v1.pdf) - Keywords: text-to-image, Controllable, diffusion transformer, image generation, Control + Keywords: Controllable, image generation, text-to-image, diffusion transformer, Control - **[1.58-bit FLUX](https://arxiv.org/abs/2412.18653v1)** (Published: 2024-12-24) Authors: Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.18653v1.pdf) - Keywords: text-to-image, FLUX, image generation + Keywords: text-to-image, image generation, FLUX - **[DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](https://arxiv.org/abs/2412.18597v1)** (Published: 2024-12-24) Authors: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.18597v1.pdf) - Keywords: Control, video editing, diffusion transformer, video generation + Keywords: diffusion transformer, video editing, Control, video generation - **[Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers](https://arxiv.org/abs/2412.16822v1)** (Published: 2024-12-22) Authors: Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine Lin Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.16822v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation + Keywords: diffusion transformer, image generation, text-to-image - **[CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up](https://arxiv.org/abs/2412.16112v1)** (Published: 2024-12-20) Authors: Songhua Liu, Zhenxiong Tan, Xinchao Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.16112v1.pdf) | [![GitHub](https://img.shields.io/github/stars/Huage001/CLEAR?style=social)](https://github.com/Huage001/CLEAR) @@ -146,7 +150,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Efficient Scaling of Diffusion Transformers for Text-to-Image Generation](https://arxiv.org/abs/2412.12391v1)** (Published: 2024-12-16) Authors: Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.12391v1.pdf) - Keywords: text-to-image, Control, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image generation, text-to-image - **[Causal Diffusion Transformers for Generative Modeling](https://arxiv.org/abs/2412.12095v2)** (Published: 2024-12-16) Authors: Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, Haoqi Fan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.12095v2.pdf) @@ -154,19 +158,19 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Video Diffusion Transformers are In-Context Learners](https://arxiv.org/abs/2412.10783v2)** (Published: 2024-12-14) Authors: Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.10783v2.pdf) | [![GitHub](https://img.shields.io/github/stars/feizc/Video-In-Context?style=social)](https://github.com/feizc/Video-In-Context) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion](https://arxiv.org/abs/2412.09828v1)** (Published: 2024-12-13) Authors: Xunnong Xu, Mengying Cao Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09828v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG](https://arxiv.org/abs/2412.09614v1)** (Published: 2024-12-12) Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09614v1.pdf) - Keywords: text-to-image, FLUX, image editing, Control + Keywords: text-to-image, image editing, Control, FLUX - **[FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers](https://arxiv.org/abs/2412.09611v1)** (Published: 2024-12-12) Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09611v1.pdf) - Keywords: image generation, Control, FLUX, image editing, rectified flow + Keywords: image generation, image editing, Control, FLUX, rectified flow - **[Multimodal Latent Language Modeling with Next-Token Diffusion](https://arxiv.org/abs/2412.08635v1)** (Published: 2024-12-11) Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.08635v1.pdf) @@ -174,51 +178,51 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[FlexDiT: Dynamic Token Density Control for Diffusion Transformer](https://arxiv.org/abs/2412.06028v1)** (Published: 2024-12-08) Authors: Shuning Chang, Pichao Wang, Jiasheng Tang, Yi Yang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.06028v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation, Control, video generation + Keywords: image generation, text-to-image, video generation, diffusion transformer, Control - **[MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation](https://arxiv.org/abs/2412.05848v1)** (Published: 2024-12-08) Authors: Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05848v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Self-Guidance: Boosting Flow and Diffusion Generation on Their Own](https://arxiv.org/abs/2412.05827v1)** (Published: 2024-12-08) Authors: Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05827v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation, FLUX, video generation + Keywords: image generation, text-to-image, video generation, diffusion transformer, FLUX - **[Language-Guided Image Tokenization for Generation](https://arxiv.org/abs/2412.05796v1)** (Published: 2024-12-08) Authors: Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05796v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation + Keywords: diffusion transformer, image generation, text-to-image - **[Mind the Time: Temporally-Controlled Multi-Event Video Generation](https://arxiv.org/abs/2412.05263v1)** (Published: 2024-12-06) Authors: Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05263v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation](https://arxiv.org/abs/2412.03859v1)** (Published: 2024-12-05) Authors: Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.03859v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://creatilayout.github.io.) - Keywords: Control, Controllable, diffusion transformer, image generation + Keywords: diffusion transformer, Controllable, Control, image generation - **[Navigation World Models](https://arxiv.org/abs/2412.03572v1)** (Published: 2024-12-04) Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.03572v1.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention](https://arxiv.org/abs/2412.03520v2)** (Published: 2024-12-04) Authors: Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.03520v2.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://luhannan.github.io/CogDrivingPage/.) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Panoptic Diffusion Models: co-generation of images and segmentation maps](https://arxiv.org/abs/2412.02929v1)** (Published: 2024-12-04) Authors: Yinghan Long, Kaushik Roy Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.02929v1.pdf) - Keywords: Control, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image generation - **[Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis](https://arxiv.org/abs/2412.02168v2)** (Published: 2024-12-03) Authors: Yu Yuan, Xijun Wang, Yichen Sheng, Prateek Chennuri, Xingguang Zhang, Stanley Chan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.02168v2.pdf) - Keywords: text-to-image, FLUX, image generation, Control + Keywords: text-to-image, Control, image generation, FLUX - **[World-consistent Video Diffusion with Explicit 3D Modeling](https://arxiv.org/abs/2412.01821v1)** (Published: 2024-12-02) Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.01821v1.pdf) - Keywords: Control, diffusion transformer, image generation, video generation + Keywords: diffusion transformer, Control, image generation, video generation - **[CPA: Camera-pose-awareness Diffusion Transformer for Video Generation](https://arxiv.org/abs/2412.01429v1)** (Published: 2024-12-02) Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.01429v1.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[TinyFusion: Diffusion Transformers Learned Shallow](https://arxiv.org/abs/2412.01199v1)** (Published: 2024-12-02) Authors: Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.01199v1.pdf) | [![GitHub](https://img.shields.io/github/stars/VainF/TinyFusion?style=social)](https://github.com/VainF/TinyFusion) @@ -226,19 +230,19 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[AMO Sampler: Enhancing Text Rendering with Overshooting](https://arxiv.org/abs/2411.19415v1)** (Published: 2024-11-28) Authors: Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.19415v1.pdf) - Keywords: text-to-image, image generation, FLUX, Control, rectified flow + Keywords: image generation, text-to-image, Control, FLUX, rectified flow - **[AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers](https://arxiv.org/abs/2411.18673v2)** (Published: 2024-11-27) Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.18673v2.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Prediction with Action: Visual Policy Learning via Joint Denoising Process](https://arxiv.org/abs/2411.18179v1)** (Published: 2024-11-27) Authors: Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.18179v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://sites.google.com/view/pad-paper) - Keywords: Control, image editing, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image editing, image generation - **[Type-R: Automatically Retouching Typos for Text-to-Image Generation](https://arxiv.org/abs/2411.18159v1)** (Published: 2024-11-27) Authors: Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seichi Uchida, Kota Yamaguchi Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.18159v1.pdf) - Keywords: text-to-image, FLUX, image generation + Keywords: text-to-image, image generation, FLUX - **[Accelerating Vision Diffusion Transformers with Skip Branches](https://arxiv.org/abs/2411.17616v2)** (Published: 2024-11-26) Authors: Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Yu Cheng Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.17616v2.pdf) | [![GitHub](https://img.shields.io/github/stars/OpenSparseLLMs/Skip-DiT.git?style=social)](https://github.com/OpenSparseLLMs/Skip-DiT.git) @@ -246,23 +250,23 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440v2)** (Published: 2024-11-26) Authors: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.17440v2.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[OminiControl: Minimal and Universal Control for Diffusion Transformer](https://arxiv.org/abs/2411.15098v3)** (Published: 2024-11-22) Authors: Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.15098v3.pdf) - Keywords: Control, diffusion transformer + Keywords: diffusion transformer, Control - **[HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads](https://arxiv.org/abs/2411.15034v1)** (Published: 2024-11-22) Authors: Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.15034v1.pdf) - Keywords: image editing, diffusion transformer, image generation + Keywords: diffusion transformer, image editing, image generation - **[Stable Flow: Vital Layers for Training-Free Image Editing](https://arxiv.org/abs/2411.14430v1)** (Published: 2024-11-21) Authors: Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.14430v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://omriavrahami.com/stable-flow) - Keywords: Control, image editing, inversion, diffusion transformer + Keywords: diffusion transformer, image editing, Control, inversion - **[Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method](https://arxiv.org/abs/2411.11135v1)** (Published: 2024-11-17) Authors: Yan Zheng, Zhenxiao Liang, Xiaoyan Cong, Lanqing guo, Yuehao Wang, Peihao Wang, Zhangyang Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.11135v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://yanyanzheng96.github.io/oscillation_inversion/}{this) - Keywords: text-to-image, FLUX, image editing, inversion, rectified flow + Keywords: rectified flow, text-to-image, image editing, FLUX, inversion - **[SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers](https://arxiv.org/abs/2411.10510v1)** (Published: 2024-11-15) Authors: Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.10510v1.pdf) @@ -270,31 +274,31 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing](https://arxiv.org/abs/2411.08196v1)** (Published: 2024-11-12) Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.08196v1.pdf) - Keywords: Control, image editing, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image editing, image generation - **[DiT4Edit: Diffusion Transformer for Image Editing](https://arxiv.org/abs/2411.03286v2)** (Published: 2024-11-05) Authors: Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, Zeyu Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.03286v2.pdf) - Keywords: diffusion transformer, image generation, Control, image editing, inversion + Keywords: image generation, diffusion transformer, image editing, Control, inversion - **[Adaptive Caching for Faster Video Generation with Diffusion Transformers](https://arxiv.org/abs/2411.02397v2)** (Published: 2024-11-04) Authors: Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, Tian Xie Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.02397v2.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Training-free Regional Prompting for Diffusion Transformers](https://arxiv.org/abs/2411.02395v1)** (Published: 2024-11-04) Authors: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.02395v1.pdf) | [![GitHub](https://img.shields.io/github/stars/antonioo-c/Regional-Prompting-FLUX?style=social)](https://github.com/antonioo-c/Regional-Prompting-FLUX) - Keywords: text-to-image, FLUX, diffusion transformer, image generation + Keywords: diffusion transformer, text-to-image, image generation, FLUX - **[GameGen-X: Interactive Open-world Game Video Generation](https://arxiv.org/abs/2411.00769v3)** (Published: 2024-11-01) Authors: Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, Hao Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.00769v3.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[In-Context LoRA for Diffusion Transformers](https://arxiv.org/abs/2410.23775v3)** (Published: 2024-10-31) Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.23775v3.pdf) | [![GitHub](https://img.shields.io/github/stars/ali-vilab/In-Context-LoRA?style=social)](https://github.com/ali-vilab/In-Context-LoRA) - Keywords: text-to-image, diffusion transformer, image generation + Keywords: diffusion transformer, image generation, text-to-image - **[Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models](https://arxiv.org/abs/2410.22775v1)** (Published: 2024-10-30) Authors: Arash Marioriyad, Parham Rezaei, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.22775v1.pdf) - Keywords: text-to-image, FLUX, image generation + Keywords: text-to-image, image generation, FLUX - **[Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models](https://arxiv.org/abs/2410.20898v2)** (Published: 2024-10-28) Authors: Weijian Luo, Colin Zhang, Debing Zhang, Zhengyang Geng Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.20898v2.pdf) | [![GitHub](https://img.shields.io/github/stars/pkulwj1994/diff_instruct_star?style=social)](https://github.com/pkulwj1994/diff_instruct_star) @@ -302,16 +306,16 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation](https://arxiv.org/abs/2410.20474v2)** (Published: 2024-10-27) Authors: Phillip Y. Lee, Taehoon Yoon, Minhyuk Sung Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.20474v2.pdf) - Keywords: text-to-image, Control, diffusion transformer, image generation -- **[FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model](https://arxiv.org/abs/2410.13925v1)** (Published: 2024-10-17) - Authors: ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai - Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.13925v1.pdf) | [![GitHub](https://img.shields.io/github/stars/whlzy/FiT?style=social)](https://github.com/whlzy/FiT) - Keywords: rectified flow, diffusion transformer, image generation + Keywords: diffusion transformer, Control, image generation, text-to-image ### Video Related -*Showing the latest 50 out of 59 papers* +*Showing the latest 50 out of 60 papers* +- **[ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning](https://arxiv.org/abs/2501.04698v1)** (Published: 2025-01-08) + Authors: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai + Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.04698v1.pdf) + Keywords: diffusion transformer, video generation - **[Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers](https://arxiv.org/abs/2501.03931v1)** (Published: 2025-01-07) Authors: Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.03931v1.pdf) | [![GitHub](https://img.shields.io/github/stars/dvlab-research/MagicMirror?style=social)](https://github.com/dvlab-research/MagicMirror) @@ -323,11 +327,11 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking](https://arxiv.org/abs/2501.02690v1)** (Published: 2025-01-05) Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.02690v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://wkbian.github.io/Projects/GS-DiT/.) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Open-Sora: Democratizing Efficient Video Production for All](https://arxiv.org/abs/2412.20404v1)** (Published: 2024-12-29) Authors: Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.20404v1.pdf) | [![GitHub](https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social)](https://github.com/hpcaitech/Open-Sora) - Keywords: text-to-image, diffusion transformer, image generation, video generation + Keywords: diffusion transformer, image generation, text-to-image, video generation - **[Accelerating Diffusion Transformers with Dual Feature Caching](https://arxiv.org/abs/2412.18911v1)** (Published: 2024-12-25) Authors: Chang Zou, Evelyn Zhang, Runlin Guo, Haohang Xu, Conghui He, Xuming Hu, Linfeng Zhang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.18911v1.pdf) | [![GitHub](https://img.shields.io/github/stars/Shenyi-Z/DuCa?style=social)](https://github.com/Shenyi-Z/DuCa) @@ -335,7 +339,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](https://arxiv.org/abs/2412.18597v1)** (Published: 2024-12-24) Authors: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.18597v1.pdf) - Keywords: Control, video editing, diffusion transformer, video generation + Keywords: diffusion transformer, video editing, Control, video generation - **[FFA Sora, video generation as fundus fluorescein angiography simulator](https://arxiv.org/abs/2412.17346v1)** (Published: 2024-12-23) Authors: Xinyuan Wu, Lili Wang, Ruoyu Chen, Bowen Liu, Weiyi Zhang, Xi Yang, Yifan Feng, Mingguang He, Danli Shi Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.17346v1.pdf) @@ -343,7 +347,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Video Diffusion Transformers are In-Context Learners](https://arxiv.org/abs/2412.10783v2)** (Published: 2024-12-14) Authors: Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.10783v2.pdf) | [![GitHub](https://img.shields.io/github/stars/feizc/Video-In-Context?style=social)](https://github.com/feizc/Video-In-Context) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity](https://arxiv.org/abs/2412.09856v1)** (Published: 2024-12-13) Authors: Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09856v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://lineargen.github.io/.) @@ -351,7 +355,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion](https://arxiv.org/abs/2412.09828v1)** (Published: 2024-12-13) Authors: Xunnong Xu, Mengying Cao Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.09828v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[From Slow Bidirectional to Fast Autoregressive Video Diffusion Models](https://arxiv.org/abs/2412.07772v2)** (Published: 2024-12-10) Authors: Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.07772v2.pdf) @@ -367,27 +371,27 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[FlexDiT: Dynamic Token Density Control for Diffusion Transformer](https://arxiv.org/abs/2412.06028v1)** (Published: 2024-12-08) Authors: Shuning Chang, Pichao Wang, Jiasheng Tang, Yi Yang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.06028v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation, Control, video generation + Keywords: image generation, text-to-image, video generation, diffusion transformer, Control - **[MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation](https://arxiv.org/abs/2412.05848v1)** (Published: 2024-12-08) Authors: Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05848v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Self-Guidance: Boosting Flow and Diffusion Generation on Their Own](https://arxiv.org/abs/2412.05827v1)** (Published: 2024-12-08) Authors: Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05827v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation, FLUX, video generation + Keywords: image generation, text-to-image, video generation, diffusion transformer, FLUX - **[Mind the Time: Temporally-Controlled Multi-Event Video Generation](https://arxiv.org/abs/2412.05263v1)** (Published: 2024-12-06) Authors: Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.05263v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Navigation World Models](https://arxiv.org/abs/2412.03572v1)** (Published: 2024-12-04) Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.03572v1.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention](https://arxiv.org/abs/2412.03520v2)** (Published: 2024-12-04) Authors: Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.03520v2.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://luhannan.github.io/CogDrivingPage/.) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text](https://arxiv.org/abs/2412.15220v1)** (Published: 2024-12-03) Authors: Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun Nagaraja, Wenwu Wang, Mark D. Plumbley, Yangyang Shi, Vikas Chandra Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.15220v1.pdf) @@ -395,11 +399,11 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[World-consistent Video Diffusion with Explicit 3D Modeling](https://arxiv.org/abs/2412.01821v1)** (Published: 2024-12-02) Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.01821v1.pdf) - Keywords: Control, diffusion transformer, image generation, video generation + Keywords: diffusion transformer, Control, image generation, video generation - **[CPA: Camera-pose-awareness Diffusion Transformer for Video Generation](https://arxiv.org/abs/2412.01429v1)** (Published: 2024-12-02) Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.01429v1.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation](https://arxiv.org/abs/2412.00115v3)** (Published: 2024-11-28) Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.00115v3.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://fudan-generative-vision.github.io/OpenHumanVid) @@ -407,7 +411,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers](https://arxiv.org/abs/2411.18673v2)** (Published: 2024-11-27) Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.18673v2.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Accelerating Vision Diffusion Transformers with Skip Branches](https://arxiv.org/abs/2411.17616v2)** (Published: 2024-11-26) Authors: Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Yu Cheng Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.17616v2.pdf) | [![GitHub](https://img.shields.io/github/stars/OpenSparseLLMs/Skip-DiT.git?style=social)](https://github.com/OpenSparseLLMs/Skip-DiT.git) @@ -415,7 +419,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440v2)** (Published: 2024-11-26) Authors: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.17440v2.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis](https://arxiv.org/abs/2411.16748v1)** (Published: 2024-11-24) Authors: Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, Jianhua Tao, Yaling Liang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.16748v1.pdf) @@ -431,15 +435,15 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Taming Rectified Flow for Inversion and Editing](https://arxiv.org/abs/2411.04746v2)** (Published: 2024-11-07) Authors: Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, Ying Shan Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.04746v2.pdf) | [![GitHub](https://img.shields.io/github/stars/wangjiangshan0725/RF-Solver-Edit?style=social)](https://github.com/wangjiangshan0725/RF-Solver-Edit) - Keywords: diffusion transformer, FLUX, inversion, video editing, video generation, rectified flow + Keywords: rectified flow, video generation, diffusion transformer, video editing, FLUX, inversion - **[Adaptive Caching for Faster Video Generation with Diffusion Transformers](https://arxiv.org/abs/2411.02397v2)** (Published: 2024-11-04) Authors: Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, Tian Xie Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.02397v2.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[GameGen-X: Interactive Open-world Game Video Generation](https://arxiv.org/abs/2411.00769v3)** (Published: 2024-11-01) Authors: Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, Hao Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.00769v3.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation](https://arxiv.org/abs/2410.20502v1)** (Published: 2024-10-27) Authors: Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, Furu Wei Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.20502v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](http://aka.ms/arlon}.) @@ -447,11 +451,11 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Boosting Camera Motion Control for Video Diffusion Transformers](https://arxiv.org/abs/2410.10802v1)** (Published: 2024-10-14) Authors: Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, Chun-Hao Paul Huang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.10802v1.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Scaling Laws For Diffusion Transformers](https://arxiv.org/abs/2410.08184v1)** (Published: 2024-10-10) Authors: Zhengyang Liang, Hao He, Ceyuan Yang, Bo Dai Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.08184v1.pdf) - Keywords: text-to-image, diffusion transformer, image generation, video generation + Keywords: diffusion transformer, image generation, text-to-image, video generation - **[Pyramidal Flow Matching for Efficient Video Generative Modeling](https://arxiv.org/abs/2410.05954v1)** (Published: 2024-10-08) Authors: Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2410.05954v1.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://pyramid-flow.github.io.) @@ -463,7 +467,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[LoVA: Long-form Video-to-Audio Generation](https://arxiv.org/abs/2409.15157v2)** (Published: 2024-09-23) Authors: Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, Ruihua Song Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2409.15157v2.pdf) - Keywords: video editing, diffusion transformer + Keywords: diffusion transformer, video editing - **[Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task](https://arxiv.org/abs/2409.04005v2)** (Published: 2024-09-06) Authors: Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2409.04005v2.pdf) | [![Project](https://img.shields.io/badge/-Project-blue)](https://360cvgroup.github.io/Qihoo-T2X/.) @@ -471,7 +475,7 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[DiVE: DiT-based Video Generation with Enhanced Control](https://arxiv.org/abs/2409.01595v1)** (Published: 2024-09-03) Authors: Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Hengtong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, Kun Zhan, Peng Jia, Miao Zhang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2409.01595v1.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers](https://arxiv.org/abs/2408.17131v1)** (Published: 2024-08-30) Authors: Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2408.17131v1.pdf) @@ -487,11 +491,11 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[Tora: Trajectory-oriented Diffusion Transformer for Video Generation](https://arxiv.org/abs/2407.21705v3)** (Published: 2024-07-31) Authors: Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.21705v3.pdf) | [![GitHub](https://img.shields.io/github/stars/alibaba/Tora?style=social)](https://github.com/alibaba/Tora) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls](https://arxiv.org/abs/2407.21136v3)** (Published: 2024-07-30) Authors: Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.21136v3.pdf) - Keywords: Control, diffusion transformer, video generation + Keywords: diffusion transformer, Control, video generation - **[Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data](https://arxiv.org/abs/2407.16134v1)** (Published: 2024-07-23) Authors: Hengyu Fu, Zehao Dou, Jiawei Guo, Mengdi Wang, Minshuo Chen Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.16134v1.pdf) @@ -503,15 +507,11 @@ Thanks to [@longxiang-ai](https://github.com/longxiang-ai) for the template. - **[VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control](https://arxiv.org/abs/2407.12781v2)** (Published: 2024-07-17) Authors: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.12781v2.pdf) - Keywords: Control, Controllable, diffusion transformer, video generation + Keywords: diffusion transformer, Controllable, Control, video generation - **[OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation](https://arxiv.org/abs/2407.02371v2)** (Published: 2024-07-02) Authors: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.02371v2.pdf) Keywords: diffusion transformer, video generation -- **[Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers](https://arxiv.org/abs/2406.17343v2)** (Published: 2024-06-25) - Authors: Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, Wenwu Zhu - Links: [![PDF](https://img.shields.io/badge/PDF-arXiv-b31b1b.svg)](https://arxiv.org/pdf/2406.17343v2.pdf) | [![GitHub](https://img.shields.io/github/stars/Juanerx/Q-DiT?style=social)](https://github.com/Juanerx/Q-DiT) - Keywords: diffusion transformer, video generation diff --git a/data/papers_2025-01-09.json b/data/papers_2025-01-09.json new file mode 100644 index 0000000..b084912 --- /dev/null +++ b/data/papers_2025-01-09.json @@ -0,0 +1,3977 @@ +[ + { + "title": "ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning", + "authors": [ + "Yuzhou Huang", + "Ziyang Yuan", + "Quande Liu", + "Qiulin Wang", + "Xintao Wang", + "Ruimao Zhang", + "Pengfei Wan", + "Di Zhang", + "Kun Gai" + ], + "abstract": "Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training such a model that represents and decouples various concepts well. To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos. Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To further overcome the scarcity of high-quality MCVC data, we carefully establish a data construction pipeline, which enables systematic collection of precise multi-concept video-entity data across diverse concepts. A comprehensive benchmark is designed to validate the effectiveness of our model from three critical dimensions: concept fidelity, identity decoupling ability, and video generation quality across six different concept composition scenarios. Extensive experiments demonstrate that our ConceptMaster significantly outperforms previous approaches for this task, paving the way for generating personalized and semantically accurate videos across multiple concepts.", + "arxiv_url": "http://arxiv.org/abs/2501.04698v1", + "pdf_url": "http://arxiv.org/pdf/2501.04698v1", + "published_date": "2025-01-08", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Circuit Complexity Bounds for Visual Autoregressive Model", + "authors": [ + "Yekun Ke", + "Xiaoyu Li", + "Yingyu Liang", + "Zhenmei Shi", + "Zhao Song" + ], + "abstract": "Understanding the expressive ability of a specific model is essential for grasping its capacity limitations. Recently, several studies have established circuit complexity bounds for Transformer architecture. Besides, the Visual AutoRegressive (VAR) model has risen to be a prominent method in the field of image generation, outperforming previous techniques, such as Diffusion Transformers, in generating high-quality images. We investigate the circuit complexity of the VAR model and establish a bound in this study. Our primary result demonstrates that the VAR model is equivalent to a simulation by a uniform $\\mathsf{TC}^0$ threshold circuit with hidden dimension $d \\leq O(n)$ and $\\mathrm{poly}(n)$ precision. This is the first study to rigorously highlight the limitations in the expressive power of VAR models despite their impressive performance. We believe our findings will offer valuable insights into the inherent constraints of these models and guide the development of more efficient and expressive architectures in the future.", + "arxiv_url": "http://arxiv.org/abs/2501.04299v1", + "pdf_url": "http://arxiv.org/pdf/2501.04299v1", + "published_date": "2025-01-08", + "categories": [ + "stat.ML", + "cs.AI", + "cs.CC", + "cs.CL", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers", + "authors": [ + "Yuechen Zhang", + "Yaoyang Liu", + "Bin Xia", + "Bohao Peng", + "Zexin Yan", + "Eric Lo", + "Jiaya Jia" + ], + "abstract": "We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/", + "arxiv_url": "http://arxiv.org/abs/2501.03931v1", + "pdf_url": "http://arxiv.org/pdf/2501.03931v1", + "published_date": "2025-01-07", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/dvlab-research/MagicMirror", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "TransPixar: Advancing Text-to-Video Generation with Transparency", + "authors": [ + "Luozhou Wang", + "Yijun Li", + "Zhifei Chen", + "Jui-Hsien Wang", + "Zhifei Zhang", + "He Zhang", + "Zhe Lin", + "Yingcong Chen" + ], + "abstract": "Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.", + "arxiv_url": "http://arxiv.org/abs/2501.03006v1", + "pdf_url": "http://arxiv.org/pdf/2501.03006v1", + "published_date": "2025-01-06", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking", + "authors": [ + "Weikang Bian", + "Zhaoyang Huang", + "Xiaoyu Shi", + "Yijin Li", + "Fu-Yun Wang", + "Hongsheng Li" + ], + "abstract": "4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.", + "arxiv_url": "http://arxiv.org/abs/2501.02690v1", + "pdf_url": "http://arxiv.org/pdf/2501.02690v1", + "published_date": "2025-01-05", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "EliGen: Entity-Level Controlled Image Generation with Regional Attention", + "authors": [ + "Hong Zhang", + "Zhongjie Duan", + "Xingjun Wang", + "Yingda Chen", + "Yu Zhang" + ], + "abstract": "Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.", + "arxiv_url": "http://arxiv.org/abs/2501.01097v1", + "pdf_url": "http://arxiv.org/pdf/2501.01097v1", + "published_date": "2025-01-02", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "image generation", + "text-to-image", + "diffusion transformer", + "Control", + "image inpainting" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Dual Diffusion for Unified Image Generation and Understanding", + "authors": [ + "Zijie Li", + "Henry Li", + "Yichun Shi", + "Amir Barati Farimani", + "Yuval Kluger", + "Linjie Yang", + "Peng Wang" + ], + "abstract": "Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.", + "arxiv_url": "http://arxiv.org/abs/2501.00289v1", + "pdf_url": "http://arxiv.org/pdf/2501.00289v1", + "published_date": "2024-12-31", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Open-Sora: Democratizing Efficient Video Production for All", + "authors": [ + "Zangwei Zheng", + "Xiangyu Peng", + "Tianji Yang", + "Chenhui Shen", + "Shenggui Li", + "Hongxin Liu", + "Yukun Zhou", + "Tianyi Li", + "Yang You" + ], + "abstract": "Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.", + "arxiv_url": "http://arxiv.org/abs/2412.20404v1", + "pdf_url": "http://arxiv.org/pdf/2412.20404v1", + "published_date": "2024-12-29", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/hpcaitech/Open-Sora", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation", + "authors": [ + "Lunhao Duan", + "Shanshan Zhao", + "Wenjun Yan", + "Yinglun Li", + "Qing-Guo Chen", + "Zhao Xu", + "Weihua Luo", + "Kaifu Zhang", + "Mingming Gong", + "Gui-Song Xia" + ], + "abstract": "Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.", + "arxiv_url": "http://arxiv.org/abs/2412.18928v1", + "pdf_url": "http://arxiv.org/pdf/2412.18928v1", + "published_date": "2024-12-25", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "Controllable", + "image generation", + "text-to-image", + "diffusion transformer", + "Control" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Accelerating Diffusion Transformers with Dual Feature Caching", + "authors": [ + "Chang Zou", + "Evelyn Zhang", + "Runlin Guo", + "Haohang Xu", + "Conghui He", + "Xuming Hu", + "Linfeng Zhang" + ], + "abstract": "Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. However, on the one hand, aggressively reusing all the features cached in previous timesteps leads to a severe drop in generation quality. On the other hand, conservatively caching only the features in the redundant layers or tokens but still computing the important ones successfully preserves the generation quality but results in reductions in acceleration ratios. Observing such a tradeoff between generation quality and acceleration performance, this paper begins by quantitatively studying the accumulated error from cached features. Surprisingly, we find that aggressive caching does not introduce significantly more caching errors in the caching step, and the conservative feature caching can fix the error introduced by aggressive caching. Thereby, we propose a dual caching strategy that adopts aggressive and conservative caching iteratively, leading to significant acceleration and high generation quality at the same time. Besides, we further introduce a V-caching strategy for token-wise conservative caching, which is compatible with flash attention and requires no training and calibration data. Our codes have been released in Github: \\textbf{Code: \\href{https://github.com/Shenyi-Z/DuCa}{\\texttt{\\textcolor{cyan}{https://github.com/Shenyi-Z/DuCa}}}}", + "arxiv_url": "http://arxiv.org/abs/2412.18911v1", + "pdf_url": "http://arxiv.org/pdf/2412.18911v1", + "published_date": "2024-12-25", + "categories": [ + "cs.LG", + "cs.AI", + "cs.CV" + ], + "github_url": "https://github.com/Shenyi-Z/DuCa", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "1.58-bit FLUX", + "authors": [ + "Chenglin Yang", + "Celong Liu", + "Xueqing Deng", + "Dongwon Kim", + "Xing Mei", + "Xiaohui Shen", + "Liang-Chieh Chen" + ], + "abstract": "We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.", + "arxiv_url": "http://arxiv.org/abs/2412.18653v1", + "pdf_url": "http://arxiv.org/pdf/2412.18653v1", + "published_date": "2024-12-24", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation", + "authors": [ + "Minghong Cai", + "Xiaodong Cun", + "Xiaoyu Li", + "Wenze Liu", + "Zhaoyang Zhang", + "Yong Zhang", + "Ying Shan", + "Xiangyu Yue" + ], + "abstract": "Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.", + "arxiv_url": "http://arxiv.org/abs/2412.18597v1", + "pdf_url": "http://arxiv.org/pdf/2412.18597v1", + "published_date": "2024-12-24", + "categories": [ + "cs.CV", + "cs.AI", + "cs.MM" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video editing", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "FFA Sora, video generation as fundus fluorescein angiography simulator", + "authors": [ + "Xinyuan Wu", + "Lili Wang", + "Ruoyu Chen", + "Bowen Liu", + "Weiyi Zhang", + "Xi Yang", + "Yifan Feng", + "Mingguang He", + "Danli Shi" + ], + "abstract": "Fundus fluorescein angiography (FFA) is critical for diagnosing retinal vascular diseases, but beginners often struggle with image interpretation. This study develops FFA Sora, a text-to-video model that converts FFA reports into dynamic videos via a Wavelet-Flow Variational Autoencoder (WF-VAE) and a diffusion transformer (DiT). Trained on an anonymized dataset, FFA Sora accurately simulates disease features from the input text, as confirmed by objective metrics: Frechet Video Distance (FVD) = 329.78, Learned Perceptual Image Patch Similarity (LPIPS) = 0.48, and Visual-question-answering Score (VQAScore) = 0.61. Specific evaluations showed acceptable alignment between the generated videos and textual prompts, with BERTScore of 0.35. Additionally, the model demonstrated strong privacy-preserving performance in retrieval evaluations, achieving an average Recall@K of 0.073. Human assessments indicated satisfactory visual quality, with an average score of 1.570(scale: 1 = best, 5 = worst). This model addresses privacy concerns associated with sharing large-scale FFA data and enhances medical education.", + "arxiv_url": "http://arxiv.org/abs/2412.17346v1", + "pdf_url": "http://arxiv.org/pdf/2412.17346v1", + "published_date": "2024-12-23", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers", + "authors": [ + "Haoran You", + "Connelly Barnes", + "Yuqian Zhou", + "Yan Kang", + "Zhenbang Du", + "Wei Zhou", + "Lingzhi Zhang", + "Yotam Nitzan", + "Xiaoyang Liu", + "Zhe Lin", + "Eli Shechtman", + "Sohrab Amirghodsi", + "Yingyan Celine Lin" + ], + "abstract": "Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One key efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in Mixture-of-Depths (MoD) efficient DiT models. Specifically, DiffRatio-MoD integrates three features: (1) A token-level routing scheme where each DiT layer includes a router that is jointly fine-tuned with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the image becomes clearer. Extensive experiments on both text-to-image and inpainting tasks show that DiffRatio-MoD effectively captures dynamism across token, layer, and timestep axes, achieving superior trade-offs between generation quality and efficiency compared to prior works.", + "arxiv_url": "http://arxiv.org/abs/2412.16822v1", + "pdf_url": "http://arxiv.org/pdf/2412.16822v1", + "published_date": "2024-12-22", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up", + "authors": [ + "Songhua Liu", + "Zhenxiong Tan", + "Xinchao Wang" + ], + "abstract": "Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.", + "arxiv_url": "http://arxiv.org/abs/2412.16112v1", + "pdf_url": "http://arxiv.org/pdf/2412.16112v1", + "published_date": "2024-12-20", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/Huage001/CLEAR", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Efficient Scaling of Diffusion Transformers for Text-to-Image Generation", + "authors": [ + "Hao Li", + "Shamit Lal", + "Zhiheng Li", + "Yusheng Xie", + "Ying Wang", + "Yang Zou", + "Orchid Majumder", + "R. Manmatha", + "Zhuowen Tu", + "Stefano Ermon", + "Stefano Soatto", + "Ashwin Swaminathan" + ], + "abstract": "We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.", + "arxiv_url": "http://arxiv.org/abs/2412.12391v1", + "pdf_url": "http://arxiv.org/pdf/2412.12391v1", + "published_date": "2024-12-16", + "categories": [ + "cs.CV", + "cs.CL", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Causal Diffusion Transformers for Generative Modeling", + "authors": [ + "Chaorui Deng", + "Deyao Zhu", + "Kunchang Li", + "Shi Guang", + "Haoqi Fan" + ], + "abstract": "We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.", + "arxiv_url": "http://arxiv.org/abs/2412.12095v2", + "pdf_url": "http://arxiv.org/pdf/2412.12095v2", + "published_date": "2024-12-16", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Video Diffusion Transformers are In-Context Learners", + "authors": [ + "Zhengcong Fei", + "Di Qiu", + "Changqian Yu", + "Debang Li", + "Mingyuan Fan", + "Xiang Wen" + ], + "abstract": "This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ($\\textbf{i}$) concatenate videos along spacial or time dimension, ($\\textbf{ii}$) jointly caption multi-scene video clips from one source, and ($\\textbf{iii}$) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that better align with prompt specifications and maintain role consistency. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems. The data, code, and model weights are publicly available at: \\url{https://github.com/feizc/Video-In-Context}.", + "arxiv_url": "http://arxiv.org/abs/2412.10783v2", + "pdf_url": "http://arxiv.org/pdf/2412.10783v2", + "published_date": "2024-12-14", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/feizc/Video-In-Context", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity", + "authors": [ + "Hongjie Wang", + "Chih-Yao Ma", + "Yen-Cheng Liu", + "Ji Hou", + "Tao Xu", + "Jialiang Wang", + "Felix Juefei-Xu", + "Yaqiao Luo", + "Peizhao Zhang", + "Tingbo Hou", + "Peter Vajda", + "Niraj K. Jha", + "Xiaoliang Dai" + ], + "abstract": "Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\\times$ (11.5$\\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.", + "arxiv_url": "http://arxiv.org/abs/2412.09856v1", + "pdf_url": "http://arxiv.org/pdf/2412.09856v1", + "published_date": "2024-12-13", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG", + "eess.IV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion", + "authors": [ + "Xunnong Xu", + "Mengying Cao" + ], + "abstract": "Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.", + "arxiv_url": "http://arxiv.org/abs/2412.09828v1", + "pdf_url": "http://arxiv.org/pdf/2412.09828v1", + "published_date": "2024-12-13", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG", + "authors": [ + "Kavana Venkatesh", + "Yusuf Dalva", + "Ismini Lourentzou", + "Pinar Yanardag" + ], + "abstract": "We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.", + "arxiv_url": "http://arxiv.org/abs/2412.09614v1", + "pdf_url": "http://arxiv.org/pdf/2412.09614v1", + "published_date": "2024-12-12", + "categories": [ + "cs.CV", + "cs.CL" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "image editing", + "Control", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers", + "authors": [ + "Yusuf Dalva", + "Kavana Venkatesh", + "Pinar Yanardag" + ], + "abstract": "Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.", + "arxiv_url": "http://arxiv.org/abs/2412.09611v1", + "pdf_url": "http://arxiv.org/pdf/2412.09611v1", + "published_date": "2024-12-12", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "image generation", + "image editing", + "Control", + "FLUX", + "rectified flow" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Multimodal Latent Language Modeling with Next-Token Diffusion", + "authors": [ + "Yutao Sun", + "Hangbo Bao", + "Wenhui Wang", + "Zhiliang Peng", + "Li Dong", + "Shaohan Huang", + "Jianyong Wang", + "Furu Wei" + ], + "abstract": "Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.", + "arxiv_url": "http://arxiv.org/abs/2412.08635v1", + "pdf_url": "http://arxiv.org/pdf/2412.08635v1", + "published_date": "2024-12-11", + "categories": [ + "cs.CL", + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models", + "authors": [ + "Tianwei Yin", + "Qiang Zhang", + "Richard Zhang", + "William T. Freeman", + "Fredo Durand", + "Eli Shechtman", + "Xun Huang" + ], + "abstract": "Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.", + "arxiv_url": "http://arxiv.org/abs/2412.07772v2", + "pdf_url": "http://arxiv.org/pdf/2412.07772v2", + "published_date": "2024-12-10", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "STIV: Scalable Text and Image Conditioned Video Generation", + "authors": [ + "Zongyu Lin", + "Wei Liu", + "Chen Chen", + "Jiasen Lu", + "Wenze Hu", + "Tsu-Jui Fu", + "Jesse Allardice", + "Zhengfeng Lai", + "Liangchen Song", + "Bowen Zhang", + "Cha Chen", + "Yiran Fei", + "Yifan Jiang", + "Lezhi Li", + "Yizhou Sun", + "Kai-Wei Chang", + "Yinfei Yang" + ], + "abstract": "The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.", + "arxiv_url": "http://arxiv.org/abs/2412.07730v1", + "pdf_url": "http://arxiv.org/pdf/2412.07730v1", + "published_date": "2024-12-10", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG", + "cs.MM" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer", + "authors": [ + "Jinyi Hu", + "Shengding Hu", + "Yuxuan Song", + "Yufei Huang", + "Mingxuan Wang", + "Hao Zhou", + "Zhiyuan Liu", + "Wei-Ying Ma", + "Maosong Sun" + ], + "abstract": "The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.", + "arxiv_url": "http://arxiv.org/abs/2412.07720v1", + "pdf_url": "http://arxiv.org/pdf/2412.07720v1", + "published_date": "2024-12-10", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "FlexDiT: Dynamic Token Density Control for Diffusion Transformer", + "authors": [ + "Shuning Chang", + "Pichao Wang", + "Jiasheng Tang", + "Yi Yang" + ], + "abstract": "Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT's spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512$\\times$512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt-$\\alpha$ on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.", + "arxiv_url": "http://arxiv.org/abs/2412.06028v1", + "pdf_url": "http://arxiv.org/pdf/2412.06028v1", + "published_date": "2024-12-08", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "image generation", + "text-to-image", + "video generation", + "diffusion transformer", + "Control" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation", + "authors": [ + "Shuwei Shi", + "Biao Gong", + "Xi Chen", + "Dandan Zheng", + "Shuai Tan", + "Zizheng Yang", + "Yuyuan Li", + "Jingwen He", + "Kecheng Zheng", + "Jingdong Chen", + "Ming Yang", + "Yinqiang Zheng" + ], + "abstract": "The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.", + "arxiv_url": "http://arxiv.org/abs/2412.05848v1", + "pdf_url": "http://arxiv.org/pdf/2412.05848v1", + "published_date": "2024-12-08", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Self-Guidance: Boosting Flow and Diffusion Generation on Their Own", + "authors": [ + "Tiancheng Li", + "Weijian Luo", + "Zhiyang Chen", + "Liyuan Ma", + "Guo-Jun Qi" + ], + "abstract": "Proper guidance strategies are essential to get optimal generation results without re-training diffusion and flow-based text-to-image models. However, existing guidances either require specific training or strong inductive biases of neural network architectures, potentially limiting their applications. To address these issues, in this paper, we introduce Self-Guidance (SG), a strong diffusion guidance that neither needs specific training nor requires certain forms of neural network architectures. Different from previous approaches, the Self-Guidance calculates the guidance vectors by measuring the difference between the velocities of two successive diffusion timesteps. Therefore, SG can be readily applied for both conditional and unconditional models with flexible network architectures. We conduct intensive experiments on both text-to-image generation and text-to-video generations across flexible architectures including UNet-based models and diffusion transformer-based models. On current state-of-the-art diffusion models such as Stable Diffusion 3.5 and FLUX, SG significantly boosts the image generation performance in terms of FID, and Human Preference Scores. Moreover, we find that SG has a surprisingly positive effect on the generation of high-quality human bodies such as hands, faces, and arms, showing strong potential to overcome traditional challenges on human body generations with minimal effort. We will release our implementation of SG on SD 3.5 and FLUX models along with this paper.", + "arxiv_url": "http://arxiv.org/abs/2412.05827v1", + "pdf_url": "http://arxiv.org/pdf/2412.05827v1", + "published_date": "2024-12-08", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "image generation", + "text-to-image", + "video generation", + "diffusion transformer", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Language-Guided Image Tokenization for Generation", + "authors": [ + "Kaiwen Zha", + "Lijun Yu", + "Alireza Fathi", + "David A. Ross", + "Cordelia Schmid", + "Dina Katabi", + "Xiuye Gu" + ], + "abstract": "Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.", + "arxiv_url": "http://arxiv.org/abs/2412.05796v1", + "pdf_url": "http://arxiv.org/pdf/2412.05796v1", + "published_date": "2024-12-08", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Mind the Time: Temporally-Controlled Multi-Event Video Generation", + "authors": [ + "Ziyi Wu", + "Aliaksandr Siarohin", + "Willi Menapace", + "Ivan Skorokhodov", + "Yuwei Fang", + "Varnith Chordia", + "Igor Gilitschenski", + "Sergey Tulyakov" + ], + "abstract": "Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.", + "arxiv_url": "http://arxiv.org/abs/2412.05263v1", + "pdf_url": "http://arxiv.org/pdf/2412.05263v1", + "published_date": "2024-12-06", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation", + "authors": [ + "Hui Zhang", + "Dexiang Hong", + "Tingwei Gao", + "Yitong Wang", + "Jie Shao", + "Xinglong Wu", + "Zuxuan Wu", + "Yu-Gang Jiang" + ], + "abstract": "Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at https://creatilayout.github.io.", + "arxiv_url": "http://arxiv.org/abs/2412.03859v1", + "pdf_url": "http://arxiv.org/pdf/2412.03859v1", + "published_date": "2024-12-05", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Navigation World Models", + "authors": [ + "Amir Bar", + "Gaoyue Zhou", + "Danny Tran", + "Trevor Darrell", + "Yann LeCun" + ], + "abstract": "Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.", + "arxiv_url": "http://arxiv.org/abs/2412.03572v1", + "pdf_url": "http://arxiv.org/pdf/2412.03572v1", + "published_date": "2024-12-04", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG", + "cs.RO" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention", + "authors": [ + "Hannan Lu", + "Xiaohe Wu", + "Shudong Wang", + "Xiameng Qin", + "Xinyu Zhang", + "Junyu Han", + "Wangmeng Zuo", + "Ji Tao" + ], + "abstract": "Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.", + "arxiv_url": "http://arxiv.org/abs/2412.03520v2", + "pdf_url": "http://arxiv.org/pdf/2412.03520v2", + "published_date": "2024-12-04", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers", + "authors": [ + "Xiaohe Ma", + "Valentin Deschaintre", + "Miloš Hašan", + "Fujun Luan", + "Kun Zhou", + "Hongzhi Wu", + "Yiwei Hu" + ], + "abstract": "High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.", + "arxiv_url": "http://arxiv.org/abs/2412.03225v2", + "pdf_url": "http://arxiv.org/pdf/2412.03225v2", + "published_date": "2024-12-04", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Panoptic Diffusion Models: co-generation of images and segmentation maps", + "authors": [ + "Yinghan Long", + "Kaushik Roy" + ], + "abstract": "Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.", + "arxiv_url": "http://arxiv.org/abs/2412.02929v1", + "pdf_url": "http://arxiv.org/pdf/2412.02929v1", + "published_date": "2024-12-04", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text", + "authors": [ + "Haohe Liu", + "Gael Le Lan", + "Xinhao Mei", + "Zhaoheng Ni", + "Anurag Kumar", + "Varun Nagaraja", + "Wenwu Wang", + "Mark D. Plumbley", + "Yangyang Shi", + "Vikas Chandra" + ], + "abstract": "Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and audio learning before joint fine-tuning. Our empirical evaluations demonstrate that SyncFlow produces audio and video outputs that are more correlated than baseline methods with significantly enhanced audio quality and audio-visual correspondence. Moreover, we demonstrate strong zero-shot capabilities of SyncFlow, including zero-shot video-to-audio generation and adaptation to novel video resolutions without further training.", + "arxiv_url": "http://arxiv.org/abs/2412.15220v1", + "pdf_url": "http://arxiv.org/pdf/2412.15220v1", + "published_date": "2024-12-03", + "categories": [ + "cs.MM", + "cs.SD", + "eess.AS" + ], + "github_url": "", + "keywords": [ + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis", + "authors": [ + "Yu Yuan", + "Xijun Wang", + "Yichen Sheng", + "Prateek Chennuri", + "Xingguang Zhang", + "Stanley Chan" + ], + "abstract": "Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.", + "arxiv_url": "http://arxiv.org/abs/2412.02168v2", + "pdf_url": "http://arxiv.org/pdf/2412.02168v2", + "published_date": "2024-12-03", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "Control", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "World-consistent Video Diffusion with Explicit 3D Modeling", + "authors": [ + "Qihang Zhang", + "Shuangfei Zhai", + "Miguel Angel Bautista", + "Kevin Miao", + "Alexander Toshev", + "Joshua Susskind", + "Jiatao Gu" + ], + "abstract": "Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.", + "arxiv_url": "http://arxiv.org/abs/2412.01821v1", + "pdf_url": "http://arxiv.org/pdf/2412.01821v1", + "published_date": "2024-12-02", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "CPA: Camera-pose-awareness Diffusion Transformer for Video Generation", + "authors": [ + "Yuelei Wang", + "Jian Zhang", + "Pengtao Jiang", + "Hao Zhang", + "Jinwei Chen", + "Bo Li" + ], + "abstract": "Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.", + "arxiv_url": "http://arxiv.org/abs/2412.01429v1", + "pdf_url": "http://arxiv.org/pdf/2412.01429v1", + "published_date": "2024-12-02", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "TinyFusion: Diffusion Transformers Learned Shallow", + "authors": [ + "Gongfan Fang", + "Kunjun Li", + "Xinyin Ma", + "Xinchao Wang" + ], + "abstract": "Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at https://github.com/VainF/TinyFusion.", + "arxiv_url": "http://arxiv.org/abs/2412.01199v1", + "pdf_url": "http://arxiv.org/pdf/2412.01199v1", + "published_date": "2024-12-02", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "https://github.com/VainF/TinyFusion", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks", + "authors": [ + "Jiahao Cui", + "Hui Li", + "Yun Zhan", + "Hanlin Shang", + "Kaihui Cheng", + "Yuqi Ma", + "Shan Mu", + "Hang Zhou", + "Jingdong Wang", + "Siyu Zhu" + ], + "abstract": "Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.", + "arxiv_url": "http://arxiv.org/abs/2412.00733v3", + "pdf_url": "http://arxiv.org/pdf/2412.00733v3", + "published_date": "2024-12-01", + "categories": [ + "cs.CV", + "cs.GR", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "AMO Sampler: Enhancing Text Rendering with Overshooting", + "authors": [ + "Xixi Hu", + "Keyang Xu", + "Bo Liu", + "Qiang Liu", + "Hongliang Fei" + ], + "abstract": "Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.", + "arxiv_url": "http://arxiv.org/abs/2411.19415v1", + "pdf_url": "http://arxiv.org/pdf/2411.19415v1", + "published_date": "2024-11-28", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "image generation", + "text-to-image", + "Control", + "FLUX", + "rectified flow" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation", + "authors": [ + "Hui Li", + "Mingwang Xu", + "Yun Zhan", + "Shan Mu", + "Jiaye Li", + "Kaihui Cheng", + "Yuxuan Chen", + "Tan Chen", + "Mao Ye", + "Jingdong Wang", + "Siyu Zhu" + ], + "abstract": "Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid", + "arxiv_url": "http://arxiv.org/abs/2412.00115v3", + "pdf_url": "http://arxiv.org/pdf/2412.00115v3", + "published_date": "2024-11-28", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers", + "authors": [ + "Sherwin Bahmani", + "Ivan Skorokhodov", + "Guocheng Qian", + "Aliaksandr Siarohin", + "Willi Menapace", + "Andrea Tagliasacchi", + "David B. Lindell", + "Sergey Tulyakov" + ], + "abstract": "Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.", + "arxiv_url": "http://arxiv.org/abs/2411.18673v2", + "pdf_url": "http://arxiv.org/pdf/2411.18673v2", + "published_date": "2024-11-27", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Prediction with Action: Visual Policy Learning via Joint Denoising Process", + "authors": [ + "Yanjiang Guo", + "Yucheng Hu", + "Jianke Zhang", + "Yen-Jen Wang", + "Xiaoyu Chen", + "Chaochao Lu", + "Jianyu Chen" + ], + "abstract": "Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities--image prediction and robotic action, respectively--they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline. Project page at https://sites.google.com/view/pad-paper", + "arxiv_url": "http://arxiv.org/abs/2411.18179v1", + "pdf_url": "http://arxiv.org/pdf/2411.18179v1", + "published_date": "2024-11-27", + "categories": [ + "cs.RO", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image editing", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Type-R: Automatically Retouching Typos for Text-to-Image Generation", + "authors": [ + "Wataru Shimoda", + "Naoto Inoue", + "Daichi Haraguchi", + "Hayato Mitani", + "Seichi Uchida", + "Kota Yamaguchi" + ], + "abstract": "While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.", + "arxiv_url": "http://arxiv.org/abs/2411.18159v1", + "pdf_url": "http://arxiv.org/pdf/2411.18159v1", + "published_date": "2024-11-27", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Accelerating Vision Diffusion Transformers with Skip Branches", + "authors": [ + "Guanjie Chen", + "Xinyu Zhao", + "Yucheng Zhou", + "Tianlong Chen", + "Yu Cheng" + ], + "abstract": "Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at https://github.com/OpenSparseLLMs/Skip-DiT.git.", + "arxiv_url": "http://arxiv.org/abs/2411.17616v2", + "pdf_url": "http://arxiv.org/pdf/2411.17616v2", + "published_date": "2024-11-26", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/OpenSparseLLMs/Skip-DiT.git", + "keywords": [ + "diffusion transformer", + "image generation", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Identity-Preserving Text-to-Video Generation by Frequency Decomposition", + "authors": [ + "Shenghai Yuan", + "Jinfa Huang", + "Xianyi He", + "Yunyuan Ge", + "Yujun Shi", + "Liuhan Chen", + "Jiebo Luo", + "Li Yuan" + ], + "abstract": "Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.", + "arxiv_url": "http://arxiv.org/abs/2411.17440v2", + "pdf_url": "http://arxiv.org/pdf/2411.17440v2", + "published_date": "2024-11-26", + "categories": [ + "cs.CV", + "cs.MM" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis", + "authors": [ + "Haojie Zhang", + "Zhihao Liang", + "Ruibo Fu", + "Zhengqi Wen", + "Xuefei Liu", + "Chenxing Li", + "Jianhua Tao", + "Yaling Liang" + ], + "abstract": "Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.", + "arxiv_url": "http://arxiv.org/abs/2411.16748v1", + "pdf_url": "http://arxiv.org/pdf/2411.16748v1", + "published_date": "2024-11-24", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "OminiControl: Minimal and Universal Control for Diffusion Transformer", + "authors": [ + "Zhenxiong Tan", + "Songhua Liu", + "Xingyi Yang", + "Qiaochu Xue", + "Xinchao Wang" + ], + "abstract": "In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.", + "arxiv_url": "http://arxiv.org/abs/2411.15098v3", + "pdf_url": "http://arxiv.org/pdf/2411.15098v3", + "published_date": "2024-11-22", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads", + "authors": [ + "Yu Xu", + "Fan Tang", + "Juan Cao", + "Yuxin Zhang", + "Xiaoyu Kong", + "Jintao Li", + "Oliver Deussen", + "Tong-Yee Lee" + ], + "abstract": "Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.", + "arxiv_url": "http://arxiv.org/abs/2411.15034v1", + "pdf_url": "http://arxiv.org/pdf/2411.15034v1", + "published_date": "2024-11-22", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image editing", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Stable Flow: Vital Layers for Training-Free Image Editing", + "authors": [ + "Omri Avrahami", + "Or Patashnik", + "Ohad Fried", + "Egor Nemchinov", + "Kfir Aberman", + "Dani Lischinski", + "Daniel Cohen-Or" + ], + "abstract": "Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify \"vital layers\" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow", + "arxiv_url": "http://arxiv.org/abs/2411.14430v1", + "pdf_url": "http://arxiv.org/pdf/2411.14430v1", + "published_date": "2024-11-21", + "categories": [ + "cs.CV", + "cs.GR", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image editing", + "Control", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "TaQ-DiT: Time-aware Quantization for Diffusion Transformers", + "authors": [ + "Xinyan Liu", + "Huihong Shi", + "Yang Xu", + "Zhongfeng Wang" + ], + "abstract": "Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook (1) the impact of reconstruction and (2) the varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, (1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem. (2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs' weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods.", + "arxiv_url": "http://arxiv.org/abs/2411.14172v1", + "pdf_url": "http://arxiv.org/pdf/2411.14172v1", + "published_date": "2024-11-21", + "categories": [ + "eess.IV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "PoM: Efficient Image and Video Generation with the Polynomial Mixer", + "authors": [ + "David Picard", + "Nicolas Dufour" + ], + "abstract": "Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.", + "arxiv_url": "http://arxiv.org/abs/2411.12663v1", + "pdf_url": "http://arxiv.org/pdf/2411.12663v1", + "published_date": "2024-11-19", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "https://github.com/davidpicard/HoMM", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method", + "authors": [ + "Yan Zheng", + "Zhenxiao Liang", + "Xiaoyan Cong", + "Lanqing guo", + "Yuehao Wang", + "Peihao Wang", + "Zhangyang Wang" + ], + "abstract": "We explore the oscillatory behavior observed in inversion methods applied to large-scale text-to-image diffusion models, with a focus on the \"Flux\" model. By employing a fixed-point-inspired iterative approach to invert real-world images, we observe that the solution does not achieve convergence, instead oscillating between distinct clusters. Through both toy experiments and real-world diffusion models, we demonstrate that these oscillating clusters exhibit notable semantic coherence. We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in rectified flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing. Furthermore, we provide quantitative results demonstrating the effectiveness of our method for tasks such as image enhancement, makeup transfer, reconstruction quality, and guided sampling quality. Higher-quality examples of videos and images are available at \\href{https://yanyanzheng96.github.io/oscillation_inversion/}{this link}.", + "arxiv_url": "http://arxiv.org/abs/2411.11135v1", + "pdf_url": "http://arxiv.org/pdf/2411.11135v1", + "published_date": "2024-11-17", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "rectified flow", + "text-to-image", + "image editing", + "FLUX", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers", + "authors": [ + "Joseph Liu", + "Joshua Geddes", + "Ziyu Guo", + "Haomiao Jiang", + "Mahesh Kumar Nandwana" + ], + "abstract": "Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.", + "arxiv_url": "http://arxiv.org/abs/2411.10510v1", + "pdf_url": "http://arxiv.org/pdf/2411.10510v1", + "published_date": "2024-11-15", + "categories": [ + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing", + "authors": [ + "Zitao Shuai", + "Chenwei Wu", + "Zhengxu Tang", + "Bowen Song", + "Liyue Shen" + ], + "abstract": "Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT's disentanglement properties and effectiveness of the EIM framework.", + "arxiv_url": "http://arxiv.org/abs/2411.08196v1", + "pdf_url": "http://arxiv.org/pdf/2411.08196v1", + "published_date": "2024-11-12", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image editing", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Taming Rectified Flow for Inversion and Editing", + "authors": [ + "Jiangshan Wang", + "Junfu Pu", + "Zhongang Qi", + "Jiayi Guo", + "Yue Ma", + "Nisha Huang", + "Yuxin Chen", + "Xiu Li", + "Ying Shan" + ], + "abstract": "Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation. Despite their robust generative capabilities, these models often struggle with inversion inaccuracies, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that effectively enhances inversion precision by mitigating the errors in the ODE-solving process of rectified flow. Specifically, we derive the exact formulation of the rectified flow ODE and apply the high-order Taylor expansion to estimate its nonlinear components, significantly enhancing the precision of ODE solutions at each timestep. Building upon RF-Solver, we further propose RF-Edit, a general feature-sharing-based framework for image and video editing. By incorporating self-attention features from the inversion process into the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments across generation, inversion, and editing tasks in both image and video modalities demonstrate the superiority and versatility of our method. The source code is available at https://github.com/wangjiangshan0725/RF-Solver-Edit.", + "arxiv_url": "http://arxiv.org/abs/2411.04746v2", + "pdf_url": "http://arxiv.org/pdf/2411.04746v2", + "published_date": "2024-11-07", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/wangjiangshan0725/RF-Solver-Edit", + "keywords": [ + "rectified flow", + "video generation", + "diffusion transformer", + "video editing", + "FLUX", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "DiT4Edit: Diffusion Transformer for Image Editing", + "authors": [ + "Kunyu Feng", + "Yue Ma", + "Bingyuan Wang", + "Chenyang Qi", + "Haozhe Chen", + "Qifeng Chen", + "Zeyu Wang" + ], + "abstract": "Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.", + "arxiv_url": "http://arxiv.org/abs/2411.03286v2", + "pdf_url": "http://arxiv.org/pdf/2411.03286v2", + "published_date": "2024-11-05", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "image generation", + "diffusion transformer", + "image editing", + "Control", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Adaptive Caching for Faster Video Generation with Diffusion Transformers", + "authors": [ + "Kumara Kahatapitiya", + "Haozhe Liu", + "Sen He", + "Ding Liu", + "Menglin Jia", + "Chenyang Zhang", + "Michael S. Ryoo", + "Tian Xie" + ], + "abstract": "Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that \"not all videos are created equal\": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.", + "arxiv_url": "http://arxiv.org/abs/2411.02397v2", + "pdf_url": "http://arxiv.org/pdf/2411.02397v2", + "published_date": "2024-11-04", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Training-free Regional Prompting for Diffusion Transformers", + "authors": [ + "Anthony Chen", + "Jianjin Xu", + "Wenzhao Zheng", + "Gaole Dai", + "Yida Wang", + "Renrui Zhang", + "Haofan Wang", + "Shanghang Zhang" + ], + "abstract": "Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.", + "arxiv_url": "http://arxiv.org/abs/2411.02395v1", + "pdf_url": "http://arxiv.org/pdf/2411.02395v1", + "published_date": "2024-11-04", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/antonioo-c/Regional-Prompting-FLUX", + "keywords": [ + "diffusion transformer", + "text-to-image", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "GameGen-X: Interactive Open-world Game Video Generation", + "authors": [ + "Haoxuan Che", + "Xuanhua He", + "Quande Liu", + "Cheng Jin", + "Hao Chen" + ], + "abstract": "We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.", + "arxiv_url": "http://arxiv.org/abs/2411.00769v3", + "pdf_url": "http://arxiv.org/pdf/2411.00769v3", + "published_date": "2024-11-01", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "In-Context LoRA for Diffusion Transformers", + "authors": [ + "Lianghua Huang", + "Wei Wang", + "Zhi-Fan Wu", + "Yupeng Shi", + "Huanzhang Dou", + "Chen Liang", + "Yutong Feng", + "Yu Liu", + "Jingren Zhou" + ], + "abstract": "Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20~100 samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA", + "arxiv_url": "http://arxiv.org/abs/2410.23775v3", + "pdf_url": "http://arxiv.org/pdf/2410.23775v3", + "published_date": "2024-10-31", + "categories": [ + "cs.CV", + "cs.GR" + ], + "github_url": "https://github.com/ali-vilab/In-Context-LoRA", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models", + "authors": [ + "Arash Marioriyad", + "Parham Rezaei", + "Mahdieh Soleymani Baghshah", + "Mohammad Hossein Rohban" + ], + "abstract": "Text-to-image (T2I) generative models, such as Stable Diffusion and DALL-E, have shown remarkable proficiency in producing high-quality, realistic, and natural images from textual descriptions. However, these models sometimes fail to accurately capture all the details specified in the input prompts, particularly concerning entities, attributes, and spatial relationships. This issue becomes more pronounced when the prompt contains novel or complex compositions, leading to what are known as compositional generation failure modes. Recently, a new open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation. Additionally, autoregressive T2I models like LlamaGen have claimed competitive visual quality performance compared to diffusion-based models. In this study, we evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark. Our findings reveal that LlamaGen, as a vanilla autoregressive model, is not yet on par with state-of-the-art diffusion models for compositional generation tasks under the same criteria, such as model size and inference time. On the other hand, the open-source diffusion-based model FLUX exhibits compositional generation capabilities comparable to the state-of-the-art closed-source model DALL-E3.", + "arxiv_url": "http://arxiv.org/abs/2410.22775v1", + "pdf_url": "http://arxiv.org/pdf/2410.22775v1", + "published_date": "2024-10-30", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models", + "authors": [ + "Weijian Luo", + "Colin Zhang", + "Debing Zhang", + "Zhengyang Geng" + ], + "abstract": "In this paper, we introduce the Diff-Instruct* (DI*), an image data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this preference alignment objective remains intractable, we demonstrate that we can efficiently compute its gradient by deriving an equivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to train a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step text-to-image model, which can generate images of a resolution of 1024x1024 with only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference time and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly in PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1 on Human Preference Score benchmark, establishing a new state-of-the-art benchmark of human-preferred 1-step text-to-image generative models. Besides the strong quantitative performances, extensive qualitative comparisons also confirm the advantages of DI* in terms of maintaining diversity, improving image layouts, and enhancing aesthetic colors. We have released our industry-ready model on the homepage: \\url{https://github.com/pkulwj1994/diff_instruct_star}.", + "arxiv_url": "http://arxiv.org/abs/2410.20898v2", + "pdf_url": "http://arxiv.org/pdf/2410.20898v2", + "published_date": "2024-10-28", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG", + "cs.MM" + ], + "github_url": "https://github.com/pkulwj1994/diff_instruct_star", + "keywords": [ + "text-to-image", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation", + "authors": [ + "Zongyi Li", + "Shujie Hu", + "Shujie Liu", + "Long Zhou", + "Jeongsoo Choi", + "Lingwei Meng", + "Xun Guo", + "Jinyu Li", + "Hefei Ling", + "Furu Wei" + ], + "abstract": "Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at \\url{http://aka.ms/arlon}.", + "arxiv_url": "http://arxiv.org/abs/2410.20502v1", + "pdf_url": "http://arxiv.org/pdf/2410.20502v1", + "published_date": "2024-10-27", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation", + "authors": [ + "Phillip Y. Lee", + "Taehoon Yoon", + "Minhyuk Sung" + ], + "abstract": "We introduce GrounDiT, a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become semantic clones. Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free approaches.", + "arxiv_url": "http://arxiv.org/abs/2410.20474v2", + "pdf_url": "http://arxiv.org/pdf/2410.20474v2", + "published_date": "2024-10-27", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model", + "authors": [ + "ZiDong Wang", + "Zeyu Lu", + "Di Huang", + "Cai Zhou", + "Wanli Ouyang", + "and Lei Bai" + ], + "abstract": "\\textit{Nature is infinitely resolution-free}. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the \\textbf{Flexible Vision Transformer} (FiT), a transformer architecture specifically designed for generating images with \\textit{unrestricted resolutions and aspect ratios}. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits $2\\times$ convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at \\url{https://github.com/whlzy/FiT} to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.", + "arxiv_url": "http://arxiv.org/abs/2410.13925v1", + "pdf_url": "http://arxiv.org/pdf/2410.13925v1", + "published_date": "2024-10-17", + "categories": [ + "cs.LG" + ], + "github_url": "https://github.com/whlzy/FiT", + "keywords": [ + "diffusion transformer", + "image generation", + "rectified flow" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Boosting Camera Motion Control for Video Diffusion Transformers", + "authors": [ + "Soon Yau Cheong", + "Duygu Ceylan", + "Armin Mustafa", + "Andrew Gilbert", + "Chun-Hao Paul Huang" + ], + "abstract": "Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)-the preferred architecture for large-scale video generation - suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.", + "arxiv_url": "http://arxiv.org/abs/2410.10802v1", + "pdf_url": "http://arxiv.org/pdf/2410.10802v1", + "published_date": "2024-10-14", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations", + "authors": [ + "Litu Rout", + "Yujia Chen", + "Nataniel Ruiz", + "Constantine Caramanis", + "Sanjay Shakkottai", + "Wen-Sheng Chu" + ], + "abstract": "Generative models transform random noise into images; their inversion aims to transform images back to structured noise for recovery and editing. This paper addresses two key tasks: (i) inversion and (ii) editing of a real image using stochastic equivalents of rectified flow models (such as Flux). Although Diffusion Models (DMs) have recently dominated the field of generative modeling for images, their inversion presents faithfulness and editability challenges due to nonlinearities in drift and diffusion. Existing state-of-the-art DM inversion approaches rely on training of additional parameters or test-time optimization of latent variables; both are expensive in practice. Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator. We prove that the resulting vector field is equivalent to a rectified stochastic differential equation. Additionally, we extend our framework to design a stochastic sampler for Flux. Our inversion method allows for state-of-the-art performance in zero-shot inversion and editing, outperforming prior works in stroke-to-image synthesis and semantic image editing, with large-scale human evaluations confirming user preference.", + "arxiv_url": "http://arxiv.org/abs/2410.10792v1", + "pdf_url": "http://arxiv.org/pdf/2410.10792v1", + "published_date": "2024-10-14", + "categories": [ + "cs.LG", + "cs.CV", + "stat.ML" + ], + "github_url": "", + "keywords": [ + "rectified flow", + "image editing", + "Control", + "FLUX", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Scaling Laws For Diffusion Transformers", + "authors": [ + "Zhengyang Liang", + "Hao He", + "Ceyuan Yang", + "Bo Dai" + ], + "abstract": "Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, e.g., image and video generation. However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget. Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT for the first time. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute. Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1e21 FLOPs. Additionally, we also demonstrate that the trend of pre-training loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.", + "arxiv_url": "http://arxiv.org/abs/2410.08184v1", + "pdf_url": "http://arxiv.org/pdf/2410.08184v1", + "published_date": "2024-10-10", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation", + "authors": [ + "Xinchen Zhang", + "Ling Yang", + "Guohao Li", + "Yaqi Cai", + "Jiake Xie", + "Yong Tang", + "Yujiu Yang", + "Mengdi Wang", + "Bin Cui" + ], + "abstract": "Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp", + "arxiv_url": "http://arxiv.org/abs/2410.07171v1", + "pdf_url": "http://arxiv.org/pdf/2410.07171v1", + "published_date": "2024-10-09", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/YangLing0818/IterComp", + "keywords": [ + "text-to-image", + "image generation", + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Pyramidal Flow Matching for Efficient Video Generative Modeling", + "authors": [ + "Yang Jin", + "Zhicheng Sun", + "Ningyuan Li", + "Kun Xu", + "Kun Xu", + "Hao Jiang", + "Nan Zhuang", + "Quzhe Huang", + "Yang Song", + "Yadong Mu", + "Zhouchen Lin" + ], + "abstract": "Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at https://pyramid-flow.github.io.", + "arxiv_url": "http://arxiv.org/abs/2410.05954v1", + "pdf_url": "http://arxiv.org/pdf/2410.05954v1", + "published_date": "2024-10-08", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Accelerating Diffusion Transformers with Token-wise Feature Caching", + "authors": [ + "Chang Zou", + "Xuyang Liu", + "Ting Liu", + "Siteng Huang", + "Linfeng Zhang" + ], + "abstract": "Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\\times$ and 1.93$\\times$ acceleration are achieved on OpenSora and PixArt-$\\alpha$ with almost no drop in generation quality.", + "arxiv_url": "http://arxiv.org/abs/2410.05317v3", + "pdf_url": "http://arxiv.org/pdf/2410.05317v3", + "published_date": "2024-10-05", + "categories": [ + "cs.LG", + "cs.AI", + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Dynamic Diffusion Transformer", + "authors": [ + "Wangbo Zhao", + "Yizeng Han", + "Jiasheng Tang", + "Kai Wang", + "Yibing Song", + "Gao Huang", + "Fan Wang", + "Yang You" + ], + "abstract": "Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.", + "arxiv_url": "http://arxiv.org/abs/2410.03456v2", + "pdf_url": "http://arxiv.org/pdf/2410.03456v2", + "published_date": "2024-10-04", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing", + "authors": [ + "Haotian Sun", + "Tao Lei", + "Bowen Zhang", + "Yanghao Li", + "Haoshuo Huang", + "Ruoming Pang", + "Bo Dai", + "Nan Du" + ], + "abstract": "Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.", + "arxiv_url": "http://arxiv.org/abs/2410.02098v2", + "pdf_url": "http://arxiv.org/pdf/2410.02098v2", + "published_date": "2024-10-02", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Effective Diffusion Transformer Architecture for Image Super-Resolution", + "authors": [ + "Kun Cheng", + "Lei Yu", + "Zhijun Tu", + "Xiao He", + "Liyu Chen", + "Yong Guo", + "Mingrui Zhu", + "Nannan Wang", + "Xinbo Gao", + "Jie Hu" + ], + "abstract": "Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.", + "arxiv_url": "http://arxiv.org/abs/2409.19589v1", + "pdf_url": "http://arxiv.org/pdf/2409.19589v1", + "published_date": "2024-09-29", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image super-resolution", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions", + "authors": [ + "Weifeng Lin", + "Xinyu Wei", + "Renrui Zhang", + "Le Zhuo", + "Shitian Zhao", + "Siyuan Huang", + "Junlin Xie", + "Yu Qiao", + "Peng Gao", + "Hongsheng Li" + ], + "abstract": "This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard", + "arxiv_url": "http://arxiv.org/abs/2409.15278v2", + "pdf_url": "http://arxiv.org/pdf/2409.15278v2", + "published_date": "2024-09-23", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/AFeng-x/PixWizard", + "keywords": [ + "Controllable", + "image generation", + "text-to-image", + "diffusion transformer", + "image editing", + "Control" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "LoVA: Long-form Video-to-Audio Generation", + "authors": [ + "Xin Cheng", + "Xihua Wang", + "Yihan Wu", + "Yuyue Wang", + "Ruihua Song" + ], + "abstract": "Video-to-audio (V2A) generation is important for video editing and post-processing, enabling the creation of semantics-aligned audio for silent video. However, most existing methods focus on generating short-form audio for short video segment (less than 10 seconds), while giving little attention to the scenario of long-form video inputs. For current UNet-based diffusion V2A models, an inevitable problem when handling long-form audio generation is the inconsistencies within the final concatenated audio. In this paper, we first highlight the importance of long-form V2A problem. Besides, we propose LoVA, a novel model for Long-form Video-to-Audio generation. Based on the Diffusion Transformer (DiT) architecture, LoVA proves to be more effective at generating long-form audio compared to existing autoregressive models and UNet-based diffusion models. Extensive objective and subjective experiments demonstrate that LoVA achieves comparable performance on 10-second V2A benchmark and outperforms all other baselines on a benchmark with long-form video input.", + "arxiv_url": "http://arxiv.org/abs/2409.15157v2", + "pdf_url": "http://arxiv.org/pdf/2409.15157v2", + "published_date": "2024-09-23", + "categories": [ + "cs.SD", + "cs.MM", + "eess.AS" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video editing" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "EDGE-Rec: Efficient and Data-Guided Edge Diffusion For Recommender Systems Graphs", + "authors": [ + "Utkarsh Priyam", + "Hemit Shah", + "Edoardo Botta" + ], + "abstract": "Most recommender systems research focuses on binary historical user-item interaction encodings to predict future interactions. User features, item features, and interaction strengths remain largely under-utilized in this space or only indirectly utilized, despite proving largely effective in large-scale production recommendation systems. We propose a new attention mechanism, loosely based on the principles of collaborative filtering, called Row-Column Separable Attention RCSA to take advantage of real-valued interaction weights as well as user and item features directly. Building on this mechanism, we additionally propose a novel Graph Diffusion Transformer GDiT architecture which is trained to iteratively denoise the weighted interaction matrix of the user-item interaction graph directly. The weighted interaction matrix is built from the bipartite structure of the user-item interaction graph and corresponding edge weights derived from user-item rating interactions. Inspired by the recent progress in text-conditioned image generation, our method directly produces user-item rating predictions on the same scale as the original ratings by conditioning the denoising process on user and item features with a principled approach.", + "arxiv_url": "http://arxiv.org/abs/2409.14689v1", + "pdf_url": "http://arxiv.org/pdf/2409.14689v1", + "published_date": "2024-09-23", + "categories": [ + "cs.IR", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task", + "authors": [ + "Jing Wang", + "Ao Ma", + "Jiasong Feng", + "Dawei Leng", + "Yuhui Yin", + "Xiaodan Liang" + ], + "abstract": "The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-$\\alpha$). The visual exhibition and source code of Qihoo-T2X is available at https://360cvgroup.github.io/Qihoo-T2X/.", + "arxiv_url": "http://arxiv.org/abs/2409.04005v2", + "pdf_url": "http://arxiv.org/pdf/2409.04005v2", + "published_date": "2024-09-06", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "DiVE: DiT-based Video Generation with Enhanced Control", + "authors": [ + "Junpeng Jiang", + "Gangyi Hong", + "Lijun Zhou", + "Enhui Ma", + "Hengtong Hu", + "Xia Zhou", + "Jie Xiang", + "Fan Liu", + "Kaicheng Yu", + "Haiyang Sun", + "Kun Zhan", + "Peng Jia", + "Miao Zhang" + ], + "abstract": "Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.", + "arxiv_url": "http://arxiv.org/abs/2409.01595v1", + "pdf_url": "http://arxiv.org/pdf/2409.01595v1", + "published_date": "2024-09-03", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers", + "authors": [ + "Juncan Deng", + "Shuaiting Li", + "Zeyu Wang", + "Hong Gu", + "Kedong Xu", + "Kejie Huang" + ], + "abstract": "The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.", + "arxiv_url": "http://arxiv.org/abs/2408.17131v1", + "pdf_url": "http://arxiv.org/pdf/2408.17131v1", + "published_date": "2024-08-30", + "categories": [ + "cs.CV", + "cs.AI", + "I.2; I.4" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Alfie: Democratising RGBA Image Generation With No $$$", + "authors": [ + "Fabio Quattrini", + "Vittorio Pippi", + "Silvia Cascianelli", + "Rita Cucchiara" + ], + "abstract": "Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at https://github.com/aimagelab/Alfie.", + "arxiv_url": "http://arxiv.org/abs/2408.14826v1", + "pdf_url": "http://arxiv.org/pdf/2408.14826v1", + "published_date": "2024-08-27", + "categories": [ + "cs.CV", + "cs.MM" + ], + "github_url": "https://github.com/aimagelab/Alfie", + "keywords": [ + "diffusion transformer", + "Control", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing", + "authors": [ + "Zitao Shuai", + "Chenwei Wu", + "Zhengxu Tang", + "Bowen Song", + "Liyue Shen" + ], + "abstract": "Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT's latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at https://anonymous.com/anonymous/EMS-Benchmark, facilitating reproducible research in this domain.", + "arxiv_url": "http://arxiv.org/abs/2408.13335v1", + "pdf_url": "http://arxiv.org/pdf/2408.13335v1", + "published_date": "2024-08-23", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image editing", + "Control", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations", + "authors": [ + "Can Qin", + "Congying Xia", + "Krithika Ramakrishnan", + "Michael Ryoo", + "Lifu Tu", + "Yihao Feng", + "Manli Shu", + "Honglu Zhou", + "Anas Awadalla", + "Jun Wang", + "Senthil Purushwalkam", + "Le Xue", + "Yingbo Zhou", + "Huan Wang", + "Silvio Savarese", + "Juan Carlos Niebles", + "Zeyuan Chen", + "Ran Xu", + "Caiming Xiong" + ], + "abstract": "We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.", + "arxiv_url": "http://arxiv.org/abs/2408.12590v2", + "pdf_url": "http://arxiv.org/pdf/2408.12590v2", + "published_date": "2024-08-22", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient", + "authors": [ + "Yanzeng Li", + "Cheng Zeng", + "Jinchao Zhang", + "Jie Zhou", + "Lei Zou" + ], + "abstract": "Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversational framework that can dynamically generate plausible medical images aligned with simulated patient symptoms, enabling diverse diagnostic skill training. Specifically, MedDiT integrates various patient Knowledge Graphs (KGs), which describe the attributes and symptoms of patients, to dynamically prompt Large Language Models' (LLMs) behavior and control the patient characteristics, mitigating hallucination during medical conversation. Additionally, a well-tuned Diffusion Transformer (DiT) model is incorporated to generate medical images according to the specified patient attributes in the KG. In this paper, we present the capabilities of MedDiT through a practical demonstration, showcasing its ability to act in diverse simulated patient cases and generate the corresponding medical images. This can provide an abundant and interactive learning experience for students, advancing medical education by offering an immersive simulation platform for future healthcare professionals. The work sheds light on the feasibility of incorporating advanced technologies like LLM, KG, and DiT in education applications, highlighting their potential to address the challenges faced in simulated patient-based medical education.", + "arxiv_url": "http://arxiv.org/abs/2408.12236v1", + "pdf_url": "http://arxiv.org/pdf/2408.12236v1", + "published_date": "2024-08-22", + "categories": [ + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer", + "authors": [ + "Zhuoyi Yang", + "Jiayan Teng", + "Wendi Zheng", + "Ming Ding", + "Shiyu Huang", + "Jiazheng Xu", + "Yuanming Yang", + "Wenyi Hong", + "Xiaohan Zhang", + "Guanyu Feng", + "Da Yin", + "Xiaotao Gu", + "Yuxuan Zhang", + "Weihan Wang", + "Yean Cheng", + "Ting Liu", + "Bin Xu", + "Yuxiao Dong", + "Jie Tang" + ], + "abstract": "We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.", + "arxiv_url": "http://arxiv.org/abs/2408.06072v2", + "pdf_url": "http://arxiv.org/pdf/2408.06072v2", + "published_date": "2024-08-12", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/THUDM/CogVideo", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion", + "authors": [ + "Xingguang Yan", + "Han-Hung Lee", + "Ziyu Wan", + "Angel X. Chang" + ], + "abstract": "We introduce a new approach for generating realistic 3D models with UV maps through a representation termed \"Object Images.\" This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.", + "arxiv_url": "http://arxiv.org/abs/2408.03178v1", + "pdf_url": "http://arxiv.org/pdf/2408.03178v1", + "published_date": "2024-08-06", + "categories": [ + "cs.CV", + "cs.GR", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Tora: Trajectory-oriented Diffusion Transformer for Video Generation", + "authors": [ + "Zhenghao Zhang", + "Junchao Liao", + "Menghao Li", + "Zuozhuo Dai", + "Bingxue Qiu", + "Siyu Zhu", + "Long Qin", + "Weizhi Wang" + ], + "abstract": "Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor(TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser(MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora's excellence in achieving high motion fidelity, while also meticulously simulating the intricate movement of the physical world. Code is available at: https://github.com/alibaba/Tora.", + "arxiv_url": "http://arxiv.org/abs/2407.21705v3", + "pdf_url": "http://arxiv.org/pdf/2407.21705v3", + "published_date": "2024-07-31", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/alibaba/Tora", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls", + "authors": [ + "Yuxuan Bian", + "Ailing Zeng", + "Xuan Ju", + "Xian Liu", + "Zhaoyang Zhang", + "Wei Liu", + "Qiang Xu" + ], + "abstract": "Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.", + "arxiv_url": "http://arxiv.org/abs/2407.21136v3", + "pdf_url": "http://arxiv.org/pdf/2407.21136v3", + "published_date": "2024-07-30", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data", + "authors": [ + "Hengyu Fu", + "Zehao Dou", + "Jiawei Guo", + "Mengdi Wang", + "Minshuo Chen" + ], + "abstract": "Diffusion Transformer, the backbone of Sora for video generation, successfully scales the capacity of diffusion models, pioneering new avenues for high-fidelity sequential data generation. Unlike static data such as images, sequential data consists of consecutive data frames indexed by time, exhibiting rich spatial and temporal dependencies. These dependencies represent the underlying dynamic model and are critical to validate the generated data. In this paper, we make the first theoretical step towards bridging diffusion transformers for capturing spatial-temporal dependencies. Specifically, we establish score approximation and distribution estimation guarantees of diffusion transformers for learning Gaussian process data with covariance functions of various decay patterns. We highlight how the spatial-temporal dependencies are captured and affect learning efficiency. Our study proposes a novel transformer approximation theory, where the transformer acts to unroll an algorithm. We support our theoretical results by numerical experiments, providing strong evidence that spatial-temporal dependencies are captured within attention layers, aligning with our approximation theory.", + "arxiv_url": "http://arxiv.org/abs/2407.16134v1", + "pdf_url": "http://arxiv.org/pdf/2407.16134v1", + "published_date": "2024-07-23", + "categories": [ + "cs.LG", + "math.ST", + "stat.ML", + "stat.TH" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Anchored Diffusion for Video Face Reenactment", + "authors": [ + "Idan Kligvasser", + "Regev Cohen", + "George Leifman", + "Ehud Rivlin", + "Michael Elad" + ], + "abstract": "Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.", + "arxiv_url": "http://arxiv.org/abs/2407.15153v1", + "pdf_url": "http://arxiv.org/pdf/2407.15153v1", + "published_date": "2024-07-21", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control", + "authors": [ + "Sherwin Bahmani", + "Ivan Skorokhodov", + "Aliaksandr Siarohin", + "Willi Menapace", + "Guocheng Qian", + "Michael Vasilkovsky", + "Hsin-Ying Lee", + "Chaoyang Wang", + "Jiaxu Zou", + "Andrea Tagliasacchi", + "David B. Lindell", + "Sergey Tulyakov" + ], + "abstract": "Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.", + "arxiv_url": "http://arxiv.org/abs/2407.12781v2", + "pdf_url": "http://arxiv.org/pdf/2407.12781v2", + "published_date": "2024-07-17", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Controllable", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Scaling Diffusion Transformers to 16 Billion Parameters", + "authors": [ + "Zhengcong Fei", + "Mingyuan Fan", + "Changqian Yu", + "Debang Li", + "Junshi Huang" + ], + "abstract": "In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\\times$512 resolution settings. The project page: https://github.com/feizc/DiT-MoE.", + "arxiv_url": "http://arxiv.org/abs/2407.11633v3", + "pdf_url": "http://arxiv.org/pdf/2407.11633v3", + "published_date": "2024-07-16", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/feizc/DiT-MoE", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation", + "authors": [ + "Kepan Nan", + "Rui Xie", + "Penghao Zhou", + "Tiehan Fan", + "Zhenheng Yang", + "Zhijie Chen", + "Xiang Li", + "Jian Yang", + "Ying Tai" + ], + "abstract": "Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.", + "arxiv_url": "http://arxiv.org/abs/2407.02371v2", + "pdf_url": "http://arxiv.org/pdf/2407.02371v2", + "published_date": "2024-07-02", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers", + "authors": [ + "Lei Chen", + "Yuan Meng", + "Chen Tang", + "Xinzhu Ma", + "Jingyan Jiang", + "Xin Wang", + "Zhi Wang", + "Wenwu Zhu" + ], + "abstract": "Recent advancements in diffusion models, particularly the architectural transformation from UNet-based models to Diffusion Transformers (DiTs), significantly improve the quality and scalability of image and video generation. However, despite their impressive capabilities, the substantial computational costs of these large-scale models pose significant challenges for real-world deployment. Post-Training Quantization (PTQ) emerges as a promising solution, enabling model compression and accelerated inference for pretrained models, without the costly retraining. However, research on DiT quantization remains sparse, and existing PTQ frameworks, primarily designed for traditional diffusion models, tend to suffer from biased quantization, leading to notable performance degradation. In this work, we identify that DiTs typically exhibit significant spatial variance in both weights and activations, along with temporal variance in activations. To address these issues, we propose Q-DiT, a novel approach that seamlessly integrates two key techniques: automatic quantization granularity allocation to handle the significant variance of weights and activations across input channels, and sample-wise dynamic activation quantization to adaptively capture activation changes across both timesteps and samples. Extensive experiments conducted on ImageNet and VBench demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W6A8 on ImageNet ($256 \\times 256$), Q-DiT achieves a remarkable reduction in FID by 1.09 compared to the baseline. Under the more challenging W4A8 setting, it maintains high fidelity in image and video generation, establishing a new benchmark for efficient, high-quality quantization in DiTs. Code is available at \\href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.", + "arxiv_url": "http://arxiv.org/abs/2406.17343v2", + "pdf_url": "http://arxiv.org/pdf/2406.17343v2", + "published_date": "2024-06-25", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "https://github.com/Juanerx/Q-DiT", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models", + "authors": [ + "Bingqi Ma", + "Zhuofan Zong", + "Guanglu Song", + "Hongsheng Li", + "Yu Liu" + ], + "abstract": "Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The LLM-Infused Diffuser framework is also one of the core technologies powering SenseMirage, a highly advanced text-to-image model.", + "arxiv_url": "http://arxiv.org/abs/2406.11831v3", + "pdf_url": "http://arxiv.org/pdf/2406.11831v3", + "published_date": "2024-06-17", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "An Analysis on Quantizing Diffusion Transformers", + "authors": [ + "Yuewei Yang", + "Jialiang Wang", + "Xiaoliang Dai", + "Peizhao Zhang", + "Hongbo Zhang" + ], + "abstract": "Diffusion Models (DMs) utilize an iterative denoising process to transform random noise into synthetic data. Initally proposed with a UNet structure, DMs excel at producing images that are virtually indistinguishable with or without conditioned text prompts. Later transformer-only structure is composed with DMs to achieve better performance. Though Latent Diffusion Models (LDMs) reduce the computational requirement by denoising in a latent space, it is extremely expensive to inference images for any operating devices due to the shear volume of parameters and feature sizes. Post Training Quantization (PTQ) offers an immediate remedy for a smaller storage size and more memory-efficient computation during inferencing. Prior works address PTQ of DMs on UNet structures have addressed the challenges in calibrating parameters for both activations and weights via moderate optimization. In this work, we pioneer an efficient PTQ on transformer-only structure without any optimization. By analysing challenges in quantizing activations and weights for diffusion transformers, we propose a single-step sampling calibration on activations and adapt group-wise quantization on weights for low-bit quantization. We demonstrate the efficiency and effectiveness of proposed methods with preliminary experiments on conditional image generation.", + "arxiv_url": "http://arxiv.org/abs/2406.11100v1", + "pdf_url": "http://arxiv.org/pdf/2406.11100v1", + "published_date": "2024-06-16", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Complex Image-Generative Diffusion Transformer for Audio Denoising", + "authors": [ + "Junhui Li", + "Pu Wang", + "Jialu Li", + "Youshan Zhang" + ], + "abstract": "The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this paper introduces a complex image-generative diffusion transformer that captures more information from the complex Fourier domain. We explore a novel diffusion transformer by integrating the transformer with a diffusion model. Our proposed model demonstrates the scalability of the transformer and expands the receptive field of sparse attention using attention diffusion. Our work is among the first to utilize diffusion transformers to deal with the image generation task for audio denoising. Extensive experiments on two benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods.", + "arxiv_url": "http://arxiv.org/abs/2406.09161v1", + "pdf_url": "http://arxiv.org/pdf/2406.09161v1", + "published_date": "2024-06-13", + "categories": [ + "cs.SD", + "eess.AS" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "DiTFastAttn: Attention Compression for Diffusion Transformer Models", + "authors": [ + "Zhihang Yuan", + "Hanling Zhang", + "Pu Lu", + "Xuefei Ning", + "Linfeng Zhang", + "Tianchen Zhao", + "Shengen Yan", + "Guohao Dai", + "Yu Wang" + ], + "abstract": "Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.", + "arxiv_url": "http://arxiv.org/abs/2406.08552v2", + "pdf_url": "http://arxiv.org/pdf/2406.08552v2", + "published_date": "2024-06-12", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "What If We Recaption Billions of Web Images with LLaMA-3?", + "authors": [ + "Xianhang Li", + "Haoqin Tu", + "Mude Hui", + "Zeyu Wang", + "Bingchen Zhao", + "Junfei Xiao", + "Sucheng Ren", + "Jieru Mei", + "Qing Liu", + "Huangjie Zheng", + "Yuyin Zhou", + "Cihang Xie" + ], + "abstract": "Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \\textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/", + "arxiv_url": "http://arxiv.org/abs/2406.08478v2", + "pdf_url": "http://arxiv.org/pdf/2406.08478v2", + "published_date": "2024-06-12", + "categories": [ + "cs.CV", + "cs.CL" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation", + "authors": [ + "Kai Wang", + "Shijian Deng", + "Jing Shi", + "Dimitrios Hatzinakos", + "Yapeng Tian" + ], + "abstract": "Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.", + "arxiv_url": "http://arxiv.org/abs/2406.07686v1", + "pdf_url": "http://arxiv.org/pdf/2406.07686v1", + "published_date": "2024-06-11", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT", + "authors": [ + "Le Zhuo", + "Ruoyi Du", + "Han Xiao", + "Yangguang Li", + "Dongyang Liu", + "Rongjie Huang", + "Wenze Liu", + "Lirui Zhao", + "Fu-Yun Wang", + "Zhanyu Ma", + "Xu Luo", + "Zehan Wang", + "Kaipeng Zhang", + "Xiangyang Zhu", + "Si Liu", + "Xiangyu Yue", + "Dingning Liu", + "Wanli Ouyang", + "Ziwei Liu", + "Yu Qiao", + "Hongsheng Li", + "Peng Gao" + ], + "abstract": "Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.", + "arxiv_url": "http://arxiv.org/abs/2406.18583v1", + "pdf_url": "http://arxiv.org/pdf/2406.18583v1", + "published_date": "2024-06-05", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation", + "authors": [ + "Tianchen Zhao", + "Tongcheng Fang", + "Enshu Liu", + "Rui Wan", + "Widyadewi Soedarmadji", + "Shiyao Li", + "Zinan Lin", + "Guohao Dai", + "Shengen Yan", + "Huazhong Yang", + "Xuefei Ning", + "Yu Wang" + ], + "abstract": "Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: \"ViDiT-Q\": Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.", + "arxiv_url": "http://arxiv.org/abs/2406.02540v2", + "pdf_url": "http://arxiv.org/pdf/2406.02540v2", + "published_date": "2024-06-04", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "text-to-image", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers", + "authors": [ + "Pengtao Chen", + "Mingzhu Shen", + "Peng Ye", + "Jianjian Cao", + "Chongjun Tu", + "Christos-Savvas Bouganis", + "Yiren Zhao", + "Tao Chen" + ], + "abstract": "Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework $\\Delta$-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called $\\Delta$-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-$\\alpha$ and DiT-XL demonstrate that the $\\Delta$-DiT can achieve a $1.6\\times$ speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging $1.12\\times$ acceleration, our method significantly outperforms existing methods. Our code will be publicly available.", + "arxiv_url": "http://arxiv.org/abs/2406.01125v1", + "pdf_url": "http://arxiv.org/pdf/2406.01125v1", + "published_date": "2024-06-03", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers", + "authors": [ + "Jun Zheng", + "Fuwei Zhao", + "Youjiang Xu", + "Xin Dong", + "Xiaodan Liang" + ], + "abstract": "Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.", + "arxiv_url": "http://arxiv.org/abs/2405.18326v2", + "pdf_url": "http://arxiv.org/pdf/2405.18326v2", + "published_date": "2024-05-28", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer", + "authors": [ + "Ruizhi Shao", + "Youxin Pang", + "Zerong Zheng", + "Jingxiang Sun", + "Yebin Liu" + ], + "abstract": "We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks or vanilla diffusion models, which struggle with complex motions, viewpoint changes, and generalization. Through extensive experiments, we demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation.", + "arxiv_url": "http://arxiv.org/abs/2405.17405v2", + "pdf_url": "http://arxiv.org/pdf/2405.17405v2", + "published_date": "2024-05-27", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "PTQ4DiT: Post-training Quantization for Diffusion Transformers", + "authors": [ + "Junyi Wu", + "Haoxuan Wang", + "Yuzhang Shang", + "Mubarak Shah", + "Yan Yan" + ], + "abstract": "The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.", + "arxiv_url": "http://arxiv.org/abs/2405.16005v3", + "pdf_url": "http://arxiv.org/pdf/2405.16005v3", + "published_date": "2024-05-25", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation", + "authors": [ + "Shentong Mo", + "Yapeng Tian" + ], + "abstract": "In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures.", + "arxiv_url": "http://arxiv.org/abs/2405.15881v1", + "pdf_url": "http://arxiv.org/pdf/2405.15881v1", + "published_date": "2024-05-24", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "TerDiT: Ternary Diffusion Models with Transformers", + "authors": [ + "Xudong Lu", + "Aojun Zhou", + "Ziyi Lin", + "Qi Liu", + "Yuhui Xu", + "Renrui Zhang", + "Yafei Wen", + "Shuai Ren", + "Peng Gao", + "Junchi Yan", + "Hongsheng Li" + ], + "abstract": "Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at https://github.com/Lucky-Lance/TerDiT.", + "arxiv_url": "http://arxiv.org/abs/2405.14854v1", + "pdf_url": "http://arxiv.org/pdf/2405.14854v1", + "published_date": "2024-05-23", + "categories": [ + "cs.CV", + "cs.LG" + ], + "github_url": "https://github.com/Lucky-Lance/TerDiT", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding", + "authors": [ + "Zhimin Li", + "Jianwei Zhang", + "Qin Lin", + "Jiangfeng Xiong", + "Yanxin Long", + "Xinchi Deng", + "Yingfang Zhang", + "Xingchao Liu", + "Minbin Huang", + "Zedong Xiao", + "Dayou Chen", + "Jiajun He", + "Jiahao Li", + "Wenyue Li", + "Chen Zhang", + "Rongwei Quan", + "Jianxiang Lu", + "Jiabin Huang", + "Xiaoyan Yuan", + "Xiaoxiao Zheng", + "Yixuan Li", + "Jihong Zhang", + "Chao Zhang", + "Meng Chen", + "Jie Liu", + "Zheng Fang", + "Weiyan Wang", + "Jinbao Xue", + "Yangyu Tao", + "Jianchen Zhu", + "Kai Liu", + "Sihuan Lin", + "Yifu Sun", + "Yun Li", + "Dongdong Wang", + "Mingtao Chen", + "Zhichao Hu", + "Xiao Xiao", + "Yan Chen", + "Yuhong Liu", + "Wei Liu", + "Di Wang", + "Yong Yang", + "Jie Jiang", + "Qinglin Lu" + ], + "abstract": "We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT", + "arxiv_url": "http://arxiv.org/abs/2405.08748v1", + "pdf_url": "http://arxiv.org/pdf/2405.08748v1", + "published_date": "2024-05-14", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/Tencent/HunyuanDiT", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer", + "authors": [ + "Zhuoyi Yang", + "Heyang Jiang", + "Wenyi Hong", + "Jiayan Teng", + "Wendi Zheng", + "Yuxiao Dong", + "Ming Ding", + "Jie Tang" + ], + "abstract": "Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT.", + "arxiv_url": "http://arxiv.org/abs/2405.04312v2", + "pdf_url": "http://arxiv.org/pdf/2405.04312v2", + "published_date": "2024-05-07", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/THUDM/Inf-DiT", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers", + "authors": [ + "Yuchuan Tian", + "Zhijun Tu", + "Hanting Chen", + "Jie Hu", + "Chao Xu", + "Yunhe Wang" + ], + "abstract": "Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.", + "arxiv_url": "http://arxiv.org/abs/2405.02730v3", + "pdf_url": "http://arxiv.org/pdf/2405.02730v3", + "published_date": "2024-05-04", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/YuchuanTian/U-DiT", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Lazy Diffusion Transformer for Interactive Image Editing", + "authors": [ + "Yotam Nitzan", + "Zongze Wu", + "Richard Zhang", + "Eli Shechtman", + "Daniel Cohen-Or", + "Taesung Park", + "Michaël Gharbi" + ], + "abstract": "We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a \"lazy\" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.", + "arxiv_url": "http://arxiv.org/abs/2404.12382v1", + "pdf_url": "http://arxiv.org/pdf/2404.12382v1", + "published_date": "2024-04-18", + "categories": [ + "cs.CV", + "cs.AI", + "cs.GR" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image editing" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers", + "authors": [ + "Nithin Gopalakrishnan Nair", + "Jeya Maria Jose Valanarasu", + "Vishal M. Patel" + ], + "abstract": "Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.", + "arxiv_url": "http://arxiv.org/abs/2404.09976v1", + "pdf_url": "http://arxiv.org/pdf/2404.09976v1", + "published_date": "2024-04-15", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction", + "authors": [ + "Keyu Tian", + "Yi Jiang", + "Zehuan Yuan", + "Bingyue Peng", + "Liwei Wang" + ], + "abstract": "We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine \"next-scale prediction\" or \"next-resolution prediction\", diverging from the standard raster-scan \"next-token prediction\". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.", + "arxiv_url": "http://arxiv.org/abs/2404.02905v2", + "pdf_url": "http://arxiv.org/pdf/2404.02905v2", + "published_date": "2024-04-03", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Condition-Aware Neural Network for Controlled Image Generation", + "authors": [ + "Han Cai", + "Muyang Li", + "Zhuoyang Zhang", + "Qinsheng Zhang", + "Ming-Yu Liu", + "Song Han" + ], + "abstract": "We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step.", + "arxiv_url": "http://arxiv.org/abs/2404.01143v1", + "pdf_url": "http://arxiv.org/pdf/2404.01143v1", + "published_date": "2024-04-01", + "categories": [ + "cs.CV", + "cs.AI" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer", + "authors": [ + "Rui Zhu", + "Yingwei Pan", + "Yehao Li", + "Ting Yao", + "Zhenglong Sun", + "Tao Mei", + "Chang Wen Chen" + ], + "abstract": "Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT. In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically, we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder, we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular, by encoding discriminative pairs with student and teacher DiT encoders, a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that, student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset, and our method achieves a competitive balance between training cost and generative capacity.", + "arxiv_url": "http://arxiv.org/abs/2403.17004v1", + "pdf_url": "http://arxiv.org/pdf/2403.17004v1", + "published_date": "2024-03-25", + "categories": [ + "cs.CV", + "cs.MM" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation", + "authors": [ + "Junsong Chen", + "Chongjian Ge", + "Enze Xie", + "Yue Wu", + "Lewei Yao", + "Xiaozhe Ren", + "Zhongdao Wang", + "Ping Luo", + "Huchuan Lu", + "Zhenguo Li" + ], + "abstract": "In this paper, we introduce PixArt-\\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\\Sigma represents a significant advancement over its predecessor, PixArt-\\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term \"weak-to-strong training\". The advancements in PixArt-\\Sigma are twofold: (1) High-Quality Training Data: PixArt-\\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.", + "arxiv_url": "http://arxiv.org/abs/2403.04692v2", + "pdf_url": "http://arxiv.org/pdf/2403.04692v2", + "published_date": "2024-03-07", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Structure-Guided Adversarial Training of Diffusion Models", + "authors": [ + "Ling Yang", + "Haotian Qian", + "Zhilong Zhang", + "Jingwei Liu", + "Bin Cui" + ], + "abstract": "Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.", + "arxiv_url": "http://arxiv.org/abs/2402.17563v2", + "pdf_url": "http://arxiv.org/pdf/2402.17563v2", + "published_date": "2024-02-27", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Cross-view Masked Diffusion Transformers for Person Image Synthesis", + "authors": [ + "Trung X. Pham", + "Zhang Kang", + "Chang D. Yoo" + ], + "abstract": "We present X-MDPT ($\\underline{Cross}$-view $\\underline{M}$asked $\\underline{D}$iffusion $\\underline{P}$rediction $\\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\\frac{2}{3}$ of the parameters and achieves $5.43 \\times$ faster inference. The code is available at https://github.com/trungpx/xmdpt.", + "arxiv_url": "http://arxiv.org/abs/2402.01516v2", + "pdf_url": "http://arxiv.org/pdf/2402.01516v2", + "published_date": "2024-02-02", + "categories": [ + "cs.CV" + ], + "github_url": "https://github.com/trungpx/xmdpt", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers", + "authors": [ + "Katherine Crowson", + "Stefan Andreas Baumann", + "Alex Birch", + "Tanishq Mathew Abraham", + "Daniel Z. Kaplan", + "Enrico Shippole" + ], + "abstract": "We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \\times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.", + "arxiv_url": "http://arxiv.org/abs/2401.11605v1", + "pdf_url": "http://arxiv.org/pdf/2401.11605v1", + "published_date": "2024-01-21", + "categories": [ + "cs.CV", + "cs.AI", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "diffusion transformer" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Latte: Latent Diffusion Transformer for Video Generation", + "authors": [ + "Xin Ma", + "Yaohui Wang", + "Gengyun Jia", + "Xinyuan Chen", + "Ziwei Liu", + "Yuan-Fang Li", + "Cunjian Chen", + "Yu Qiao" + ], + "abstract": "We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.", + "arxiv_url": "http://arxiv.org/abs/2401.03048v1", + "pdf_url": "http://arxiv.org/pdf/2401.03048v1", + "published_date": "2024-01-05", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models", + "authors": [ + "Dvir Samuel", + "Barak Meiri", + "Haggai Maron", + "Yoad Tewel", + "Nir Darshan", + "Shai Avidan", + "Gal Chechik", + "Rami Ben-Ari" + ], + "abstract": "Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the exact same image. Most current deterministic inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. We formulate the problem by finding the roots of an implicit equation and devlop a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. We show that a vanilla application of NR is computationally infeasible while naively transforming it to a computationally tractable alternative tends to converge to out-of-distribution solutions, resulting in poor reconstruction and editing. We therefore derive an efficient guided formulation that fastly converges and provides high-quality reconstructions and editing. We showcase our method on real image editing with three popular open-sourced diffusion models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic schedulers. Our solution, Guided Newton-Raphson Inversion, inverts an image within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1), opening the door for interactive image editing. We further show improved results in image interpolation and generation of rare objects.", + "arxiv_url": "http://arxiv.org/abs/2312.12540v4", + "pdf_url": "http://arxiv.org/pdf/2312.12540v4", + "published_date": "2023-12-19", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "text-to-image", + "image editing", + "FLUX", + "inversion" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "GenTron: Diffusion Transformers for Image and Video Generation", + "authors": [ + "Shoufa Chen", + "Mengmeng Xu", + "Jiawei Ren", + "Yuren Cong", + "Sen He", + "Yanping Xie", + "Animesh Sinha", + "Ping Luo", + "Tao Xiang", + "Juan-Manuel Perez-Rua" + ], + "abstract": "In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.", + "arxiv_url": "http://arxiv.org/abs/2312.04557v2", + "pdf_url": "http://arxiv.org/pdf/2312.04557v2", + "published_date": "2023-12-07", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis", + "authors": [ + "Junsong Chen", + "Jincheng Yu", + "Chongjian Ge", + "Lewei Yao", + "Enze Xie", + "Yue Wu", + "Zhongdao Wang", + "James Kwok", + "Ping Luo", + "Huchuan Lu", + "Zhenguo Li" + ], + "abstract": "The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \\$300,000 (\\$26,000 vs. \\$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.", + "arxiv_url": "http://arxiv.org/abs/2310.00426v3", + "pdf_url": "http://arxiv.org/pdf/2310.00426v3", + "published_date": "2023-09-30", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "Control", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models", + "authors": [ + "Feihong He", + "Gang Li", + "Lingyu Si", + "Leilei Yan", + "Shimeng Hou", + "Hongwei Dong", + "Fanzhang Li" + ], + "abstract": "Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models using images of cartoon style. In this paper, we present CartoonDiff, a novel training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into the semantic generation phase and the detail generation phase. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn't require any additional reference images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The project page is available at: https://cartoondiff.github.io/", + "arxiv_url": "http://arxiv.org/abs/2309.08251v1", + "pdf_url": "http://arxiv.org/pdf/2309.08251v1", + "published_date": "2023-09-15", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "image generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", + "authors": [ + "Haoyu Lu", + "Guoxing Yang", + "Nanyi Fei", + "Yuqi Huo", + "Zhiwu Lu", + "Ping Luo", + "Mingyu Ding" + ], + "abstract": "This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. 2) It facilitates flexible conditioning information, \\eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. 3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. Additionally, we present comprehensive studies on how \\model handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Project page: https:VDT-2023.github.io", + "arxiv_url": "http://arxiv.org/abs/2305.13311v2", + "pdf_url": "http://arxiv.org/pdf/2305.13311v2", + "published_date": "2023-05-22", + "categories": [ + "cs.CV" + ], + "github_url": "", + "keywords": [ + "diffusion transformer", + "video generation" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion", + "authors": [ + "Seongmin Lee", + "Benjamin Hoover", + "Hendrik Strobelt", + "Zijie J. Wang", + "ShengYun Peng", + "Austin Wright", + "Kevin Li", + "Haekyu Park", + "Haoyang Yang", + "Duen Horng Chau" + ], + "abstract": "Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.", + "arxiv_url": "http://arxiv.org/abs/2305.03509v3", + "pdf_url": "http://arxiv.org/pdf/2305.03509v3", + "published_date": "2023-05-04", + "categories": [ + "cs.CL", + "cs.AI", + "cs.HC", + "cs.LG" + ], + "github_url": "", + "keywords": [ + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Semantic-Conditional Diffusion Networks for Image Captioning", + "authors": [ + "Jianjie Luo", + "Yehao Li", + "Yingwei Pan", + "Ting Yao", + "Jianlin Feng", + "Hongyang Chao", + "Tao Mei" + ], + "abstract": "Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \\url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.", + "arxiv_url": "http://arxiv.org/abs/2212.03099v1", + "pdf_url": "http://arxiv.org/pdf/2212.03099v1", + "published_date": "2022-12-06", + "categories": [ + "cs.CV", + "cs.CL", + "cs.MM" + ], + "github_url": "https://github.com/YehLi/xmodaler", + "keywords": [ + "diffusion transformer", + "image generation", + "text-to-image" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Spectral and Imaging properties of Sgr A* from High-Resolution 3D GRMHD Simulations with Radiative Cooling", + "authors": [ + "Doosoo Yoon", + "Koushik Chatterjee", + "Sera Markoff", + "David van Eijnatten", + "Ziri Younsi", + "Matthew Liska", + "Alexander Tchekhovskoy" + ], + "abstract": "The candidate supermassive black hole in the Galactic Centre, Sagittarius A* (Sgr A*), is known to be fed by a radiatively inefficient accretion flow (RIAF), inferred by its low accretion rate. Consequently, radiative cooling has in general been overlooked in the study of Sgr A*. However, the radiative properties of the plasma in RIAFs are poorly understood. In this work, using full 3D general-relativistic magneto-hydrodynamical simulations, we study the impact of radiative cooling on the dynamical evolution of the accreting plasma, presenting spectral energy distributions and synthetic sub-millimeter images generated from the accretion flow around Sgr A*. These simulations solve the approximated equations for radiative cooling processes self-consistently, including synchrotron, bremsstrahlung, and inverse Compton processes. We find that radiative cooling plays an increasingly important role in the dynamics of the accretion flow as the accretion rate increases: the mid-plane density grows and the infalling gas is less turbulent as cooling becomes stronger. The changes in the dynamical evolution become important when the accretion rate is larger than $10^{-8}\\,M_{\\odot}~{\\rm yr}^{-1}$ ($\\gtrsim 10^{-7} \\dot{M}_{\\rm Edd}$, where $\\dot{M}_{\\rm Edd}$ is the Eddington accretion rate). The resulting spectra in the cooled models also differ from those in the non-cooled models: the overall flux, including the peak values at the sub-mm and the far-UV, is slightly lower as a consequence of a decrease in the electron temperature. Our results suggest that radiative cooling should be carefully taken into account in modelling Sgr A* and other low-luminosity active galactic nuclei that have a mass accretion rate of $\\dot{M} > 10^{-7}\\,\\dot{M}_{\\rm Edd}$.", + "arxiv_url": "http://arxiv.org/abs/2009.14227v1", + "pdf_url": "http://arxiv.org/pdf/2009.14227v1", + "published_date": "2020-09-29", + "categories": [ + "astro-ph.HE" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Galaxy cluster mass estimation with deep learning and hydrodynamical simulations", + "authors": [ + "Z. Yan", + "A. J. Mead", + "L. Van Waerbeke", + "G. Hinshaw", + "I. G. McCarthy" + ], + "abstract": "We evaluate the ability of Convolutional Neural Networks (CNNs) to predict galaxy cluster masses in the BAHAMAS hydrodynamical simulations. We train four separate single-channel networks using: stellar mass, soft X-ray flux, bolometric X-ray flux, and the Compton $y$ parameter as observational tracers, respectively. Our training set consists of $\\sim$4800 synthetic cluster images generated from the simulation, while an additional $\\sim$3200 images form a validation set and a test set, each with 1600 images. In order to mimic real observation, these images also contain uncorrelated structures located within 50 Mpc in front and behind clusters and seen in projection, as well as instrumental systematics including noise and smoothing. In addition to CNNs for all the four observables, we also train a `multi-channel' CNN by combining the four observational tracers. The learning curves of all the five CNNs converge within 1000 epochs. The resulting predictions are especially precise for halo masses in the range $10^{13.25}M_{\\odot}50% of the total flux at arcsecond scales comes from near the horizon, and that the emission is dramatically suppressed interior to this region by a factor >10, providing direct evidence of the predicted shadow of a black hole. Across all methods, we measure a crescent diameter of 42+/-3 micro-as and constrain its fractional width to be <0.5. Associating the crescent feature with the emission surrounding the black hole shadow, we infer an angular gravitational radius of GM/Dc2 = 3.8+/- 0.4 micro-as. Folding in a distance measurement of 16.8(+0.8,-0.7) Mpc gives a black hole mass of M = 6.5 +/- 0.2(stat) +/-0.7(sys) 10^9 Msun. This measurement from lensed emission near the event horizon is consistent with the presence of a central Kerr black hole, as predicted by the general theory of relativity.", + "arxiv_url": "http://arxiv.org/abs/1906.11243v1", + "pdf_url": "http://arxiv.org/pdf/1906.11243v1", + "published_date": "2019-06-26", + "categories": [ + "astro-ph.GA", + "astro-ph.HE", + "gr-qc" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "AARTFAAC Flux Density Calibration and Northern Hemisphere Catalogue at 60 MHz", + "authors": [ + "Mark Kuiack", + "Folkert Huizinga", + "Gijs Molenaar", + "Peeyush Prasad", + "Antonia Rowlinson", + "Ralph A. M. J. Wijers" + ], + "abstract": "We present a method for calibrating the flux density scale for images generated by the Amsterdam ASTRON Radio Transient Facility And Analysis Centre (AARTFAAC). AARTFAAC produces a stream of all-sky images at a rate of one second in order to survey the Northern Hemisphere for short duration, low frequency transients, such as the prompt EM counterpart to gravitational wave events, magnetar flares, blazars, and other as of yet unobserved phenomena. Therefore, an independent flux density scaling solution per image is calculated via bootstrapping, comparing the measured apparent brightness of sources in the field to a reference catalogue. However, the lack of accurate flux density measurements of bright sources below 74 MHz necessitated the creation of the AARTFAAC source catalogue, at 60 MHz, which contains 167 sources across the Northern Hemisphere. Using this as a reference results in a sufficiently high number of detected sources in each image to calculate a stable and accurate flux scale per one second snapshot, in real-time.", + "arxiv_url": "http://arxiv.org/abs/1810.06430v1", + "pdf_url": "http://arxiv.org/pdf/1810.06430v1", + "published_date": "2018-10-15", + "categories": [ + "astro-ph.IM" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "ALMACAL IV: A catalogue of ALMA calibrator continuum observations", + "authors": [ + "M. Bonato", + "E. Liuzzo", + "A. Giannetti", + "M. Massardi", + "G. De Zotti", + "S. Burkutean", + "V. Galluzzi", + "M. Negrello", + "I. Baronchelli", + "J. Brand", + "M. A. Zwaan", + "K. L. J. Rygl", + "N. Marchili", + "A. Klitsch", + "I. Oteo" + ], + "abstract": "We present a catalogue of ALMA flux density measurements of 754 calibrators observed between August 2012 and September 2017, for a total of 16,263 observations in different bands and epochs. The flux densities were measured reprocessing the ALMA images generated in the framework of the ALMACAL project, with a new code developed by the Italian node of the European ALMA Regional Centre. A search in the online databases yielded redshift measurements for 589 sources ($\\sim$78 per cent of the total). Almost all sources are flat-spectrum, based on their low-frequency spectral index, and have properties consistent with being blazars of different types. To illustrate the properties of the sample we show the redshift and flux density distributions as well as the distributions of the number of observations of individual sources and of time spans in the source frame for sources observed in bands 3 (84$-$116 GHz) and 6 (211$-$275 GHz). As examples of the scientific investigations allowed by the catalogue we briefly discuss the variability properties of our sources in ALMA bands 3 and 6 and the frequency spectra between the effective frequencies of these bands. We find that the median variability index steadily increases with the source-frame time lag increasing from 100 to 800 days, and that the frequency spectra of BL Lacs are significantly flatter than those of flat-spectrum radio quasars. We also show the global spectral energy distributions of our sources over 17 orders of magnitude in frequency.", + "arxiv_url": "http://arxiv.org/abs/1805.00024v1", + "pdf_url": "http://arxiv.org/pdf/1805.00024v1", + "published_date": "2018-04-30", + "categories": [ + "astro-ph.GA" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "ProFound: Source Extraction and Application to Modern Survey Data", + "authors": [ + "A. S. G. Robotham", + "L. J. M. Davies", + "S. P. Driver", + "S. Koushan", + "D. S. Taranu", + "S. Casura", + "J. Liske" + ], + "abstract": "We introduce ProFound, a source finding and image analysis package. ProFound provides methods to detect sources in noisy images, generate segmentation maps identifying the pixels belonging to each source, and measure statistics like flux, size and ellipticity. These inputs are key requirements of ProFit, our recently released galaxy profiling package, where the design aim is that these two software packages will be used in unison to semi-automatically profile large samples of galaxies. The key novel feature introduced in ProFound is that all photometry is executed on dilated segmentation maps that fully contain the identifiable flux, rather than using more traditional circular or ellipse based photometry. Also, to be less sensitive to pathological segmentation issues, the de-blending is made across saddle points in flux. We apply ProFound in a number of simulated and real world cases, and demonstrate that it behaves reasonably given its stated design goals. In particular, it offers good initial parameter estimation for ProFit, and also segmentation maps that follow the sometimes complex geometry of resolved sources, whilst capturing nearly all of the flux. A number of bulge-disc decomposition projects are already making use of the ProFound and ProFit pipeline, and adoption is being encouraged by publicly releasing the software for the open source R data analysis platform under an LGPL-3 license on GitHub (github.com/asgr/ProFound).", + "arxiv_url": "http://arxiv.org/abs/1802.00937v1", + "pdf_url": "http://arxiv.org/pdf/1802.00937v1", + "published_date": "2018-02-03", + "categories": [ + "astro-ph.IM" + ], + "github_url": "https://github.com/asgr/ProFound", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "Observational signatures of a kink-unstable coronal flux rope using Hinode/EIS", + "authors": [ + "Ben Snow", + "Gert J. J. Botha", + "Stephane Regnier", + "Richard J. Morton", + "Erwin Verwichte", + "Peter R Young" + ], + "abstract": "The signatures of energy release and energy transport for a kink-unstable coronal flux rope are investigated via forward modelling. Synthetic intensity and Doppler maps are generated from a 3D numerical simulation. The CHIANTI database is used to compute intensities for three Hinode/EIS emission lines that cover the thermal range of the loop. The intensities and Doppler velocities at simulation resolution are spatially degraded to the Hinode/EIS pixel size (1\\arcsec), convolved using a Gaussian point-spread function (3\\arcsec), and exposed for a characteristic time of 50 seconds. The synthetic images generated for rasters (moving slit) and sit-and-stare (stationary slit) are analysed to find the signatures of the twisted flux and the associated instability. We find that there are several qualities of a kink-unstable coronal flux rope that can be detected observationally using Hinode/EIS, namely the growth of the loop radius, the increase in intensity towards the radial edge of the loop, and the Doppler velocity following an internal twisted magnetic field line. However, EIS cannot resolve the small, transient features present in the simulation, such as sites of small-scale reconnection (e.g. nanoflares)", + "arxiv_url": "http://arxiv.org/abs/1705.05114v1", + "pdf_url": "http://arxiv.org/pdf/1705.05114v1", + "published_date": "2017-05-15", + "categories": [ + "astro-ph.SR" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "The SOFIA Massive (SOMA) Star Formation Survey. I. Overview and First Results", + "authors": [ + "James M. De Buizer", + "Mengyao Liu", + "Jonathan C. Tan", + "Yichen Zhang", + "Maria T. Beltran", + "Ralph Shuping", + "Jan E. Staff", + "Kei E. I. Tanaka", + "Barbara Whitney" + ], + "abstract": "We present an overview and first results of the Stratospheric Observatory For Infrared Astronomy Massive (SOMA) Star Formation Survey, which is using the FORCAST instrument to image massive protostars from $\\sim10$--$40\\:\\rm{\\mu}\\rm{m}$. These wavelengths trace thermal emission from warm dust, which in Core Accretion models mainly emerges from the inner regions of protostellar outflow cavities. Dust in dense core envelopes also imprints characteristic extinction patterns at these wavelengths, causing intensity peaks to shift along the outflow axis and profiles to become more symmetric at longer wavelengths. We present observational results for the first eight protostars in the survey, i.e., multiwavelength images, including some ancillary ground-based MIR observations and archival {\\it{Spitzer}} and {\\it{Herschel}} data. These images generally show extended MIR/FIR emission along directions consistent with those of known outflows and with shorter wavelength peak flux positions displaced from the protostar along the blueshifted, near-facing sides, thus confirming qualitative predictions of Core Accretion models. We then compile spectral energy distributions and use these to derive protostellar properties by fitting theoretical radiative transfer models. Zhang and Tan models, based on the Turbulent Core Model of McKee and Tan, imply the sources have protostellar masses $m_*\\sim10$--50$\\:M_\\odot$ accreting at $\\sim10^{-4}$--$10^{-3}\\:M_\\odot\\:{\\rm{yr}}^{-1}$ inside cores of initial masses $M_c\\sim30$--500$\\:M_\\odot$ embedded in clumps with mass surface densities $\\Sigma_{\\rm{cl}}\\sim0.1$--3$\\:{\\rm{g\\:cm}^{-2}}$. Fitting Robitaille et al. models typically leads to slightly higher protostellar masses, but with disk accretion rates $\\sim100\\times$ smaller. We discuss reasons for these differences and overall implications of these first survey results for massive star formation theories.", + "arxiv_url": "http://arxiv.org/abs/1610.05373v5", + "pdf_url": "http://arxiv.org/pdf/1610.05373v5", + "published_date": "2016-10-17", + "categories": [ + "astro-ph.GA", + "astro-ph.SR" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + }, + { + "title": "The effect of low mass substructures on the Cusp lensing relation", + "authors": [ + "Andrea V. Maccio'", + "Marco Miranda" + ], + "abstract": "It has been argued that the flux anomalies detected in gravitationally lensed QSOs are evidence for substructures in the foreground lensing haloes. In this paper we investigate this issue in greater detail focusing on the Cusp relation which corresponds to images of a source located to the cusp of the inner caustic curve. We use numerical simulations combined with a Monte Carlo approach to study the effects of the expected power law distribution of substructures within LCDM haloes on the multiple images. Generally, the high number of anomalous flux ratios in the cusp configurations is unlikely explained by 'simple' perturbers (subhaloes) inside the lensing galaxy, either modeled by point masses or extended NFW subhaloes. We considered in our analysis a mass range of 10^5-10^7 Msun for the subhaloes. We also demonstrate that including the effects of the surrounding mass distribution, such as other galaxies close to the primary lens, does not change the results. We conclude that triple images of lensed QSOs do not show any direct evidence for dark dwarf galaxies such as cold dark matter substructure.", + "arxiv_url": "http://arxiv.org/abs/astro-ph/0509598v2", + "pdf_url": "http://arxiv.org/pdf/astro-ph/0509598v2", + "published_date": "2005-09-20", + "categories": [ + "astro-ph" + ], + "github_url": "", + "keywords": [ + "FLUX" + ], + "citations": 0, + "semantic_url": "" + } +] \ No newline at end of file