We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
我们介绍了RealmDreamer,这是一种从文本描述生成通用前向3D场景的技术。我们的技术优化了3D高斯喷涂表示,以匹配复杂的文本提示。我们利用最先进的文本到图像生成器初始化这些喷涂,将其样本提升到3D,并计算遮挡体积。然后,我们将这种表示作为一个3D内画任务,在多个视图中进行优化,使用基于图像的扩散模型。为了学习正确的几何结构,我们通过对来自内画模型的样本进行条件化,纳入了一个深度扩散模型,从而提供丰富的几何结构。最后,我们使用来自图像生成器的锐化样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视图数据,可以合成多种不同风格的高质量3D场景,包括多个对象。其通用性还允许从单一图像进行3D合成。