We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.
我们提出了一种针对高斯散射(Gaussian Splats)的扩散模型,称为SplatDiffusion,以从单张图像生成三维结构,解决将二维输入提升为三维的病态问题。现有方法依赖于确定性、前馈式预测,这限制了它们处理从二维数据推断三维固有模糊性的能力。 扩散模型最近被证明是三维数据(包括高斯散射)的强大生成模型。然而,标准的扩散框架通常要求目标信号和去噪信号处于相同模态中,这在三维数据稀缺的情况下具有挑战性。为了解决这一问题,我们提出了一种新的训练策略,将去噪模态与监督模态解耦。具体来说,我们利用一个确定性模型作为噪声教师,生成带噪信号,并从单步去噪过渡到通过图像渲染损失监督的多步去噪,大幅提升了相较于确定性教师的性能。 此外,我们的方法具有灵活性,可通过最小适配从不同的三维高斯散射(3DGS)教师中学习;实验表明,我们的方法优于两种不同的确定性教师模型,展现了框架的潜在泛化能力。我们的方法还结合了一种指导机制,以聚合来自多视角的信息,在可用多个视角时进一步提高重建质量。 在物体级和场景级数据集上的实验结果证明了我们框架的有效性。