We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models.
我们介绍了GRM,这是一种大规模重建器,能够在大约0.1秒内从稀疏视图图像中恢复出3D资产。GRM是一个基于前馈变换器的模型,有效地结合多视图信息,将输入像素转换为与像素对齐的高斯,这些高斯被反投影以创建一组密集分布的3D高斯,代表一个场景。我们的变换器架构和3D高斯的使用共同解锁了一个可扩展和高效的重建框架。广泛的实验结果证明了我们的方法在重建质量和效率方面都优于其他选择。我们还展示了GRM在生成任务中的潜力,即通过将其与现有的多视图扩散模型集成,实现文本到3D和图像到3D。