Aided by text-to-image and text-to-video diffusion models, existing 4D content creation pipelines utilize score distillation sampling to optimize the entire dynamic 3D scene. However, as these pipelines generate 4D content from text or image inputs, they incur significant time and effort in prompt engineering through trial and error. This work introduces 4DGen, a novel, holistic framework for grounded 4D content creation that decomposes the 4D generation task into multiple stages. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos), thus offering superior control over content creation. Furthermore, we construct our 4D representation using dynamic 3D Gaussians, which permits efficient, high-resolution supervision through rendering during training, thereby facilitating high-quality 4D generation. Additionally, we employ spatial-temporal pseudo labels on anchor frames, along with seamless consistency priors implemented through 3D-aware score distillation sampling and smoothness regularizations. Compared to existing baselines, our approach yields competitive results in faithfully reconstructing input signals and realistically inferring renderings from novel viewpoints and timesteps. Most importantly, our method supports grounded generation, offering users enhanced control, a feature difficult to achieve with previous methods.
在文本到图像和文本到视频扩散模型的帮助下,现有的4D内容创建管道使用分数蒸馏采样来优化整个动态3D场景。然而,由于这些管道是从文本或图像输入生成4D内容的,它们在通过试错进行提示工程时耗费了大量时间和精力。本工作介绍了4DGen,这是一个用于根据地面情况创建4D内容的新颖、整体框架,它将4D生成任务分解为多个阶段。我们确定静态3D资产和单镜头视频序列是构建4D内容的关键组成部分。我们的管道促进了条件性4D生成,使用户能够指定几何(3D资产)和运动(单镜头视频),从而提供对内容创建的更高控制。此外,我们使用动态3D高斯构建我们的4D表征,这允许在训练期间通过渲染进行高效、高分辨率的监督,从而促进高质量的4D生成。此外,我们在锚定帧上使用时空伪标签,通过3D感知的分数蒸馏采样和平滑度正则化实现无缝一致性先验。与现有基线相比,我们的方法在忠实重建输入信号和从新视点和时间步骤现实地推断渲染方面取得了竞争性结果。最重要的是,我们的方法支持根据地面情况的生成,为用户提供增强的控制,这是以前方法难以实现的功能。