A Survey on Video Diffusion Models

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

(Source: Make-A-Video, SimDA, PYoCo, SVD , Video LDM and Tune-A-Video)

[News] The updated version is available on arXiv.
[News] Our survey is accepted by ACM Computing Surveys (CSUR).
[News] The Chinese translation is available on Zhihu. Special thanks to Dai-Wenxun for this.

Contact

If you have any suggestions or find our work helpful, feel free to contact us

If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{xing2023survey,
  title={A survey on video diffusion models},
  author={Xing, Zhen and Feng, Qijun and Chen, Haoran and Dai, Qi and Hu, Han and Xu, Hang and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={ACM Computing Surveys},
  year={2023},
  publisher={ACM New York, NY}
}

Open-source Toolboxes and Foundation Models

Methods	Task	Github
Movie Gen	T2V Generation	-
CogVideoX	T2V Generation
Open-Sora-Plan	T2V Generation
Open-Sora	T2V Generation
Morph Studio	T2V Generation	-
Genie	T2V Generation	-
Sora	T2V Generation & Editing	-
VideoPoet	T2V Generation & Editing	-
Stable Video Diffusion	T2V Generation
NeverEnds	T2V Generation	-
Pika	T2V Generation	-
EMU-Video	T2V Generation	-
GEN-2	T2V Generation & Editing	-
ModelScope	T2V Generation
ZeroScope	T2V Generation	-
T2V Synthesis Colab	T2V Genetation
VideoCraft	T2V Genetation & Editing
Diffusers (T2V synthesis)	T2V Genetation	-
AnimateDiff	Personalized T2V Genetation
Text2Video-Zero	T2V Genetation
HotShot-XL	T2V Genetation
Genmo	T2V Genetation	-
Fliki	T2V Generation	-

Video Generation
- Data
- - Caption-level
- - Category-level
- T2V Generation
- - Training-based
- - Training-free
- Video Generation with other Condtions
- - Pose-gudied
- - Instruct-guided
- - Sound-guided
- - Brain-guided
- - Multi-Modal guided
- Unconditional Video Generation
- - U-Net based
- - Transformer-based
- Video Completion
- - Video Enhance and Restoration
- - Video Prediction
Video Editing
- Text guided Video Editing
- - Training-based Editing
- - One-shot Editing
- - Traning-free
- Modality-guided Video Editing
- - Motion-guided
- - Instruct-guided
- - Sound-guided
- - Multi-Modal Control
- Domain-specific editing
- Non-diffusion editing
Video Understanding
Contact

Video Generation

Data

Caption-level

Title	Github	WebSite	Pub. & Date
Identity-Preserving Text-to-Video Generation by Frequency Decomposition			Nov., 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation			NeurIPS., 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers			CVPR, 2024
CelebV-Text: A Large-Scale Facial Text-Video Dataset		-	CVPR, 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation		-	May, 2023
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation	-	-	May, 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	-	-	Nov, 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	-	-	ICCV, 2021
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	-	-	CVPR, 2016

Category-level

Title	Github	WebSite	Pub. & Date
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild	-	-	Dec., 2012
First Order Motion Model for Image Animation	-	-	May, 2023
Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks	-	-	CVPR,2018

Metric and BenchMark

Title	Github	WebSite	Pub. & Date
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos		-	Jul., 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation			NeurIPS, 2024
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models		-	ICLR, 2024
Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment	-	-	Mar, 2024
Towards A Better Metric for Text-to-Video Generation	-		Jan, 2024
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI	-	-	Jan, 2024
VBench: Comprehensive Benchmark Suite for Video Generative Models			Nov, 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	-	-	NeurIPS, 2023
CVPR 2023 Text Guided Video Editing Competition	-	-	Oct., 2023
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models			Oct., 2023
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset	-	-	Sep., 2023

Text-to-Video Generation

Training-based

Title	arXiv	Github	WebSite	Pub. & Date
Identity-Preserving Text-to-Video Generation by Frequency Decomposition				Nov., 2024
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning				NeurIPS 2024
Movie Gen		-		Oct, 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer			-	Oct, 2024
Grid Diffusion Models for Text-to-Video Generation				CVPR, 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators				Apr., 2024
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework		-	-	Mar., 2024
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis		-	-	Mar., 2024
Genie: Generative Interactive Environments		-		Feb., 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis		-		Feb., 2024
Lumiere: A Space-Time Diffusion Model for Video Generation		-		Jan, 2024
UNIVG: TOWARDS UNIFIED-MODAL VIDEO GENERATION		-		Jan, 2024
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models				Jan, 2024
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model		-		Jan, 2024
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation		-		Jan, 2024
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM		-		Jan, 2024
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos				Dec, 2023
InstructVideo: Instructing Video Diffusion Models with Human Feedback				Dec, 2023
VideoLCM: Video Latent Consistency Model		-	-	Dec, 2023
Photorealistic Video Generation with Diffusion Models		-		Dec, 2023
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation				Dec, 2023
Delving Deep into Diffusion Transformers for Image and Video Generation		-		Dec, 2023
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter				Nov, 2023
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation		-		Nov, 2023
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models				Nov, 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets				Nov, 2023
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline				Nov, 2023
MoVideo: Motion-Aware Video Generation with Diffusion Models		-		Nov, 2023
Make Pixels Dance: High-Dynamic Video Generation		-		Nov, 2023
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning		-		Nov, 2023
Optimal Noise pursuit for Augmenting Text-to-Video Generation		-	-	Nov, 2023
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning		-		Nov, 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation				Oct, 2023
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction				Oct, 2023
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors				Oct., 2023
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation				Oct., 2023
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model				Oct, 2023
MotionDirector: Motion Customization of Text-to-Video Diffusion Models				Oct, 2023
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning				Sep., 2023
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation				Sep., 2023
LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models				Sep., 2023
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation				Sep., 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation		-		Sep., 2023
MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text		-	-	Jul., 2023
Text2Performer: Text-Driven Human Video Generation				Apr., 2023
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning				Jul., 2023
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models		-		Aug., 2023
SimDA: Simple Diffusion Adapter for Efficient Video Generation				CVPR, 2024
Dual-Stream Diffusion Net for Text-to-Video Generation		-	-	Aug., 2023
ModelScope Text-to-Video Technical Report				Aug., 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation			-	Jul., 2023
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation		-	-	May, 2023
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models		-		May, 2023
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models		-		-
Latent-Shift: Latent Diffusion with Temporal Shift		-		-
Probabilistic Adaptation of Text-to-Video Models		-		Jun., 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation		-		Mar., 2023
ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation	-	-	-	IJCNN, 2023
MagicVideo: Efficient Video Generation With Latent Diffusion Models		-		-
Phenaki: Variable Length Video Generation From Open Domain Textual Description		-		-
Imagen Video: High Definition Video Generation With Diffusion Models		-		-
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation				-
MAGVIT: Masked Generative Video Transformer		-		Dec., 2022
Make-A-Video: Text-to-Video Generation without Text-Video Data		-		-
Latent Video Diffusion Models for High-Fidelity Video Generation With Arbitrary Lengths				Nov., 2022
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers			-	May, 2022
Video Diffusion Models		-		-

Training-free

Title	Github	WebSite	Pub. & Date
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models			Mar, 2024
TRAILBLAZER: TRAJECTORY CONTROL FOR DIFFUSION-BASED VIDEO GENERATION			Jan, 2024
FreeInit: Bridging Initialization Gap in Video Diffusion Models			Dec, 2023
MTVG : Multi-text Video Generation with Text-to-Video Models	-		Dec, 2023
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis	-	-	Nov, 2023
AdaDiff: Adaptive Step Selection for Fast Diffusion	-	-	Nov, 2023
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax			Nov, 2023
🏀GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning			Nov, 2023
FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling			Oct, 2023
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation			Oct, 2023
LLM-grounded Video Diffusion Models			Oct, 2023
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator		-	NeurIPS, 2023
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis			Aug, 2023
Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation		-	May, 2023
Text2video-Zero: Text-to-Image Diffusion Models Are Zero-Shot Video Generators			Mar., 2023
PEEKABOO: Interactive Video Generation via Masked-Diffusion 🫣			CVPR, 2024

Video Generation with other conditions

Pose-guided Video Generation

Title	Github	WebSite	Pub. & Date
🔥🔥StableAnimator: High-Quality Identity-Preserving Human Image Animation🔥🔥			Nov., 2024
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model			ECCV 2024
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance			Jul., 2024
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance			Mar., 2024
Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions	-	-	Mar., 2024
Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons	-	-	Jan., 2024
DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models	-		Dec., 2023
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model			Nov., 2023
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation			Nov., 2023
MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer			Nov., 2023
DisCo: Disentangled Control for Referring Human Dance Generation in Real World			Jul., 2023
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model	-	-	Aug., 2023
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion			Apr., 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos			Apr., 2023

Motion-guided Video Generation

Title	Github	WebSite	Pub. & Date
MOTIONCLONE: TRAINING-FREE MOTION CLONING FOR CONTROLLABLE VIDEO GENERATION			Jun., 2024
Tora: Trajectory-oriented Diffusion Transformer for Video Generation			Jul., 2024
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model			ECCV 2024
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance			Mar., 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling	-	-	Jan., 2024
Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation	-	-	Jan., 2024
Customizing Motion in Text-to-Video Diffusion Models	-		Dec., 2023
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models			CVPR 2024
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance			Nov., 2023
Motion-Conditioned Diffusion Model for Controllable Video Synthesis	-		Apr., 2023
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory	-	-	Aug., 2023

Sound-guided Video Generation

Title	Github	WebSite	Pub. & Date
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation			Jun., 2024
Context-aware Talking Face Video Generation	-	-	Feb., 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions			Feb., 2024
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion	-	-	ICCV, 2023
Generative Disco: Text-to-Video Generation for Music Visualization	-	-	Apr., 2023
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion	-	-	CVPRW, 2023

Image-guided Video Generation

Title	Github	WebSite	Pub. & Date
Identity-Preserving Text-to-Video Generation by Frequency Decomposition			Nov., 2024
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation			ECCV 2024
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models			CVPR 2024
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model			NeurIPS 2024
Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation	-		Mar., 2024
AtomoVideo: High Fidelity Image-to-Video Generation	-		Mar., 2024
Animated Stickers: Bringing Stickers to Life with Video Diffusion	-	-	Feb., 2024
CONSISTI2V: Enhancing Visual Consistency for Image-to-Video Generation	-		Feb., 2024
I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models	-	-	Dec., 2023
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models	-		Dec., 2023
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance	-		Nov., 2023
LivePhoto: Real Image Animation with Text-guided Motion Control			Nov., 2023
VideoBooth: Diffusion-based Video Generation with Image Prompts			Nov., 2023
Decouple Content and Motion for Conditional Image-to-Video Generation	-	-	Nov, 2023
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	-	-	Nov, 2023
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image	-	-	MM, 2023
Generative Image Dynamics	-		Sep., 2023
LaMD: Latent Motion Diffusion for Video Generation	-	-	Apr., 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion Models		-	CVPR 2023
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis			CVPR 2022

Brain-guided Video Generation

Title	arXiv	Github	WebSite	Pub. & Date
NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties		-	-	Feb., 2024
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity				NeurIPS, 2023

Depth-guided Video Generation

Title	arXiv	Github	WebSite	Pub. & Date
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing				Nov., 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation				Jul., 2023
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance				Jun., 2023

Multi-modal guided Video Generation

Title	Github	WebSite	Pub. & Date
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control	-	-	Mar., 2024
Magic-Me: Identity-Specific Video Customized Diffusion	-		Feb., 2024
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions	-		Feb., 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion	-		Feb., 2024
Boximator: Generating Rich and Controllable Motions for Video Synthesis	-		Feb., 2024
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning	-	-	Jan., 2024
ActAnywhere: Subject-Aware Video Background Generation	-		Jan., 2024
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects	-	-	Jan., 2024
MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions			Jan., 2024
PEEKABOO: Interactive Video Generation via Masked-Diffusion	-		Dec., 2023
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling	-	-	Dec., 2023
Fine-grained Controllable Video Generation via Object Appearance and Context	-		Nov., 2023
GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware Generation	-		Nov., 2023
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving	-		Nov., 2023
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models	-		Nov., 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability			Jun., 2023
NExT-GPT: Any-to-Any Multimodal LLM	-	-	Sep, 2023
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images	-		Jun, 2023
Any-to-Any Generation via Composable Diffusion			May, 2023
Mm-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation		-	CVPR 2023

Unconditional Video Generation

U-Net based

Title	WebSite	Pub. & Date
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation		Feb. 2024
Video Probabilistic Diffusion Models in Projected Latent Space		CVPR 2023
VIDM: Video Implicit Diffusion Models		AAAI 2023
GD-VDM: Generated Depth for better Diffusion-based Video Generation	-	Jun., 2023
LEO: Generative Latent Image Animator for Human Video Synthesis		May., 2023

Transformer based

Title	WebSite	Pub. & Date
Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach	-	Oct., 2024
Latte: Latent Diffusion Transformer for Video Generation		Jan., 2024
VDT: An Empirical Study on Video Diffusion with Transformers	-	May, 2023
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer		May, 2023

Video Completion

Video Enhancement and Restoration

Title	arXiv	Github	WebSite	Pub. & Date
Towards Language-Driven Video Inpainting via Multimodal Large Language Models				Jan., 2024
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution	-	-	-	WACW, 2023
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution				Dec., 2023
AVID: Any-Length Video Inpainting with Diffusion Model				Dec., 2023
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution			-	CVPR 2023
LDMVFI: Video Frame Interpolation with Latent Diffusion Models		-	-	Mar., 2023
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming		-	-	Nov., 2022
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos		-	-	May., 2023

Video Prediction

Title	Github	Website	Pub. & Date
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction			Jun, 2024
STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction		-	Dec, 2023
Video Diffusion Models with Local-Global Context Guidance		-	IJCAI, 2023
Seer: Language Instructed Video Prediction with Latent Diffusion Models	-		Mar., 2023
MaskViT: Masked Visual Pre-Training for Video Prediction			Jun, 2022
Diffusion Models for Video Prediction and Infilling			TMLR 2022
McVd: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation			NeurIPS 2022
Diffusion Probabilistic Modeling for Video Generation		-	Mar., 2022
Flexible Diffusion Modeling of Long Videos			May, 2022
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models			May, 2023

Video Editing

General Editing Model

Title	Github	Website	Pub. Date
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing			Jun, 2024
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation	-	-	Mar., 2024
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing	-	-	Mar., 2024
DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing	-		Mar, 2024
Video Editing via Factorized Diffusion Distillation	-	-	Mar, 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis			Dec, 2023
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers	-		Dec, 2023
Neutral Editing Framework for Diffusion-based Video Editing	-		Dec, 2023
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence	-		Nov, 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models			Nov, 2023
Motion-Conditioned Image Animation for Video Editing	-		Nov, 2023
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation	-	-	Sep, 2023
MagicEdit: High-Fidelity and Temporally Coherent Video Editing	-	-	Aug, 2023
Edit Temporal-Consistent Videos with Image Diffusion Model	-	-	Aug, 2023
Structure and Content-Guided Video Synthesis With Diffusion Models	-		ICCV, 2023
Dreamix: Video Diffusion Models Are General Video Editors	-		Feb, 2023

Training-free Editing Model

Title	Github	Website	Pub. Date
MVOC: a training-free multiple video object composition method with diffusion models			Jun, 2024
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing			Jun, 2024
EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing			March, 2024
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing	-		Feb, 2024
Object-Centric Diffusion for Efficient Video Editing	-	-	Jan, 2024
RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing	-	-	Dec, 2023
VidToMe: Video Token Merging for Zero-Shot Video Editing			Dec, 2023
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing			Dec, 2023
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators		-	Dec, 2023
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models			Dec, 2023
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models	-		Nov., 2023
Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion	-	-	Nov., 2023
FastBlend: a Powerful Model-Free Toolkit Making Video Stylization Easier		-	Oct., 2023
LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation	-	-	Nov., 2023
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models	-	-	Oct., 2023
LOVECon: Text-driven Training-Free Long Video Editing with ControlNet		-	Oct., 2023
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing	-		Oct., 2023
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models			ICLR, 2024
MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance	-	-	Aug., 2023
EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints	-	-	Aug., 2023
ControlVideo: Training-free Controllable Text-to-Video Generation		-	May, 2023
TokenFlow: Consistent Diffusion Features for Consistent Video Editing			Jul., 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing	-		Jun., 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation	-		Jun., 2023
Zero-Shot Video Editing Using Off-the-Shelf Image Diffusion Models			Mar., 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing			Mar., 2023
Pix2video: Video Editing Using Image Diffusion	-		Mar., 2023
InFusion: Inject and Attention Fusion for Multi Concept Zero Shot Text based Video Editing	-		Aug., 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising			May, 2023

One-shot Editing Model

Title	Github	Website	Pub. & Date
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models	-		Feb., 2024
MotionCrafter: One-Shot Motion Customization of Diffusion Models		-	Dec., 2023
DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing	-		Dec., 2023
MotionEditor: Editing Video Motion via Content-Aware Diffusion			CVPR, 2024
Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning	-		Nov., 2023
Cut-and-Paste: Subject-Driven Video Editing with Attention Control	-	-	Nov, 2023
StableVideo: Text-driven Consistency-aware Diffusion Video Editing			ICCV, 2023
Shape-aware Text-driven Layered Video Editing	-	-	CVPR, 2023
SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing		-	May, 2023
Towards Consistent Video Editing with Text-to-Image Diffusion Models	-	-	Mar., 2023
Edit-A-Video: Single Video Editing with Object-Aware Consistency	-		Mar., 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation			ICCV, 2023
ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing			May, 2023
Video-P2P: Video Editing with Cross-attention Control			Mar., 2023
SinFusion: Training Diffusion Models on a Single Image or Video			Nov., 2022

Instruct-guided Video Editing

Title	Github	Website	Pub. Date
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing			Jun, 2024
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models	-	-	Mar, 2024
Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis	-		Dec, 2023
Neural Video Fields Editing			Dec, 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models			Nov, 2023
Consistent Video-to-Video Transfer Using Synthetic Dataset	-	-	Nov., 2023
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions	-	-	May, 2023
Collaborative Score Distillation for Consistent Visual Synthesis	-	-	July, 2023

Motion-guided Video Editing

Title	Github	Website	Pub. Date
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation			Nov, 2023
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction	-		Nov, 2023
DragVideo: Interactive Drag-style Video Editing		-	Nov, 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet	-		July, 2023

Sound-guided Video Editing

Title	arXiv	Github	Website	Pub. Date
Speech Driven Video Editing via an Audio-Conditioned Diffusion Model		-	-	May., 2023
Soundini: Sound-Guided Diffusion for Natural Video Editing				Apr., 2023

Multi-modal Control Editing Model

Title	arXiv	Github	Website	Pub. Date
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	-			Dec, 2023
Motionshop: An application of replacing the characters in video with 3D avatars	-			Dec, 2023
Anything in Any Scene: Photorealistic Video Object Insertion				Jan, 2024
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion				Dec, 2023
MagicStick: Controllable Video Editing via Control Handle Transformations				Nov, 2023
SAVE: Protagonist Diversification with Structure Agnostic Video Editing		-		Nov, 2023
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation		-	-	May, 2023
CCEdit: Creative and Controllable Video Editing via Diffusion Models		-	-	Sep, 2023
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts				May, 2023

Domain-specific Editing Model

Title	Github	Website	Pub. Date
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation	-		Jan. 2024
Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models	-		Jan. 2024
TRAINING-FREE SEMANTIC VIDEO COMPOSITION VIA PRE-TRAINED DIFFUSION MODEL	-	-	Jan, 2024
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models	-		CVPR 2023
Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator	-	-	May, 2023
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis	-	-	Aug, 2023
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer		-	May, 2023
Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions		-	Jun, 2023
Video Colorization with Pre-trained Text-to-Image Diffusion Models			Jun, 2023
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding			CVPR 2023

Non-diffusion Editing model

Title	Github	Pub. Date
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing	-	Oct., 2023
INVE: Interactive Neural Video Editing	-	Jul., 2023
Shape-Aware Text-Driven Layered Video Editing	-	Jan., 2023

Video Understanding

Title	Github	Website	Pub. Date
EchoReel: Enhancing Action Generation of Existing Video Diffusion Modelsl	-	-	Mar., 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model	-	-	Mar., 2024
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion	-	-	Mar., 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models	-	-	Mar., 2024
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation	-	-	Mar., 2024
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction	-	-	Mar., 2024
Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval	-	-	Jan., 2024
Diffusion Reward: Learning Rewards via Conditional Video Diffusion			Dec., 2023
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models	-		Nov., 2023
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models		-	Nov., 2023
Flow-Guided Diffusion for Video Inpainting		-	Nov., 2023
Breathing Life Into Sketches Using Text-to-Video Priors	-	-	Nov., 2023
Infusion: Internal Diffusion for Video Inpainting	-	-	Nov., 2023
DiffusionVMR: Diffusion Model for Video Moment Retrieval	-	-	Aug., 2023
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation	-	-	Aug., 2023
CoTracker: It is Better to Track Together			Aug., 2023
Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion Representations	-	-	ICIAP, 2023
Exploring Diffusion Models for Unsupervised Video Anomaly Detection	-	-	Apr., 2023
Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection	-	-	ICCV, 2023
Diffusion Action Segmentation	-	-	Mar., 2023
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion			Mar., 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model		-	ICCV, 2023
MomentDiff: Generative Video Moment Retrieval from Random to Real			Jul., 2023
Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition			Feb., 2023
Refined Semantic Enhancement Towards Frequency Diffusion for Video Captioning	-	-	Nov., 2022
A Generalist Framework for Panoptic Segmentation of Images and Videos			Oct., 2022
DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models	-	-	Jul., 2023
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming	-	-	Mar., 2023
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition	-	-	Jul., 2023
PDPP: Projected Diffusion for Procedure Planning in Instructional Videos		-	CVPR 2023

Files

README.md

Latest commit

History