We tackle the challenge of efficiently reconstructing a 3D asset from a single image with growing demands for automated 3D content creation pipelines. Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF). Despite their significant success, these approaches encounter practical limitations due to lengthy optimization and considerable memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D reconstruction model from single-view images, emphasizing two main insights: (1) 3D representation: leveraging a large number of 3D Gaussians for an efficient 3D Gaussian splatting process; (2) Backbone design: introducing a Mamba-based sequential network that facilitates context-dependent reasoning and linear scalability with the sequence (token) length, accommodating a substantial number of Gaussians. Gamba incorporates significant advancements in data preprocessing, regularization design, and training methodologies. We assessed Gamba against existing optimization-based and feed-forward 3D generation approaches using the real-world scanned OmniObject3D dataset. Here, Gamba demonstrates competitive generation capabilities, both qualitatively and quantitatively, while achieving remarkable speed, approximately 0.6 second on a single NVIDIA A100 GPU.
我们应对的挑战是如何从单张图片高效重建3D资产,随着自动化3D内容创建流程的需求不断增长。以往的方法主要依赖于得分蒸馏采样(SDS)和神经辐射场(NeRF)。尽管这些方法取得了显著的成功,但由于优化时间长和内存使用量大,这些方法遇到了实际限制。在这份报告中,我们介绍了Gamba,一种从单视图图像到端到端摊销的3D重建模型,强调两个主要见解:(1)3D表示:利用大量3D高斯进行高效的3D高斯喷溅过程;(2)骨干网络设计:引入基于Mamba的序列网络,该网络促进了依赖于上下文的推理,并随着序列(令牌)长度线性扩展,能够容纳大量的高斯。Gamba在数据预处理、正则化设计和训练方法论方面取得了重大进步。我们使用真实世界扫描的OmniObject3D数据集,将Gamba与现有的基于优化和前馈的3D生成方法进行了评估。这里,Gamba在质量和数量上都展现了竞争性的生成能力,同时在单个NVIDIA A100 GPU上达到了约0.6秒的显著速度。