Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems. Previous approaches haveadopted Neural Radiance Fields (NeRFs) to analyze 3D scenes. In this paper, we introduce SemanticGaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our keyidea is distilling pre-trained 2D semantics into 3D Gaussians. We design a versatile projection approachthat maps various 2Dsemantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, withoutthe additional training required by NeRFs. We further build a 3D semantic network that directly predictsthe semantic component from raw 3D Gaussians for fast inference. We explore several applications ofSemantic Gaussians: semantic segmentation on ScanNet-20, where our approach attains a 4.2% mIoU and 4.0%mAcc improvement over prior open-vocabulary scene understanding counterparts; object part segmentation,sceneediting, and spatial-temporal segmentation with better qualitative results over 2D and 3D baselines,highlighting its versatility and effectiveness on supporting diverse downstream tasks.
开放词汇的3D场景理解在计算机视觉中提出了重大挑战,它在体现智能体和增强现实系统中有广泛的应用。之前的方法采用了神经辐射场(NeRFs)来分析3D场景。在这篇论文中,我们介绍了SemanticGaussians,一种基于3D高斯喷溅的新颖开放词汇场景理解方法。我们的关键思想是将预训练的2D语义提炼到3D高斯中。我们设计了一种多功能的投影方法,将来自预训练图像编码器的各种2D语义特征映射到3D高斯的一个新颖的语义组成部分,而无需NeRFs所需的额外训练。我们进一步构建了一个3D语义网络,可以直接从原始3D高斯预测语义组成部分,以快速推理。我们探索了SemanticGaussians的几种应用:在ScanNet-20上的语义分割,我们的方法相比之前的开放词汇场景理解方法,在mIoU上获得了4.2%的提升,在mAcc上获得了4.0%的提升;物体部分分割、场景编辑和空间-时间分割,在定性结果上超过了2D和3D基线,突出了其在支持多样化下游任务上的多功能性和有效性。