Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection, despite that we are 851× faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments.
为了准确感知现实世界三维物体的几何和语义属性,对于增强现实和机器人应用的持续发展至关重要。为此,我们提出了一种集成视觉-语言嵌入基础模型的三维高斯溅射方法(FMGS),将基础模型的视觉-语言嵌入融入到三维高斯溅射(GS)中。这项工作的主要贡献是一种高效的三维视觉-语言模型重建和表示方法。这是通过将基于图像的基础模型生成的特征图蒸馏到我们的三维模型渲染的特征图中来实现的。为了确保高质量的渲染和快速训练,我们通过结合GS和多分辨率哈希编码(MHE)的优势,引入了一种新颖的场景表示方法。我们的有效训练程序还引入了像素对齐损失,使相同语义实体的渲染特征距离靠近,遵循像素级语义边界。我们的结果展示了显著的多视图语义一致性,有助于多种下游任务,打败了最先进方法,在开放词汇语言基础的物体检测上提高了10.2%,尽管我们的推理速度快851倍。这项研究探索了视觉、语言和三维场景表示的交集,为在不受控的现实世界环境中增强场景理解铺平了道路。