Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress, these techniques often face limitations due to slow optimization or rendering processes, leading to extensive training and optimization times. In this paper, we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation. This hybrid representation strikes a balance, achieving a faster rendering speed compared to implicit representations while simultaneously delivering superior rendering quality than explicit representations. The point decoder is designed for generating point clouds from single images, offering an explicit representation which is then utilized by the triplane decoder to query Gaussian features for each point. This design choice addresses the challenges associated with directly regressing explicit 3D Gaussian attributes characterized by their non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to enable rapid rendering through splatting. Both decoders are built upon a scalable, transformer-based architecture and have been efficiently trained on large-scale 3D datasets. The evaluations conducted on both synthetic datasets and real-world images demonstrate that our method not only achieves higher quality but also ensures a faster runtime in comparison to previous state-of-the-art techniques.
近期在从单张图片进行3D重建方面的进展,是由生成模型的发展驱动的。其中突出的方法基于得分蒸馏采样(SDS)和扩散模型在3D领域的适应。尽管这些技术取得了进展,但由于优化或渲染过程缓慢,通常面临着漫长的训练和优化时间的限制。在这篇论文中,我们介绍了一种用于单视图重建的新方法,该方法通过前馈推理有效地从单张图片生成3D模型。我们的方法使用两个基于变换器的网络,即点解码器和三平面解码器,利用混合三平面-高斯中间表示来重建3D对象。这种混合表示实现了平衡,在与隐式表示相比提供更快的渲染速度的同时,也比显式表示提供了更优秀的渲染质量。点解码器旨在从单张图片生成点云,提供了一个显式表示,然后被三平面解码器用来为每个点查询高斯特征。这种设计选择解决了直接回归非结构性质的显式3D高斯属性所带来的挑战。随后,3D高斯由一个多层感知机(MLP)解码,以通过喷溅实现快速渲染。这两个解码器都建立在可扩展的基于变换器的架构之上,并已在大规模3D数据集上进行了高效训练。在合成数据集和真实世界图片上进行的评估表明,我们的方法不仅实现了更高的质量,而且与以前的最新技术相比,确保了更快的运行时间。