Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5×speedup).
近年来,一些研究将高斯散点(Gaussian Splatting)与语言嵌入相结合,用于开放词汇的三维场景理解。这些方法尽管表现良好,但通常需要非常密集的多视角输入,从而限制了其在真实世界场景中的适用性。为解决这一问题,我们提出了SparseLGS,一种针对无位姿稀疏视图输入的三维场景理解方法。 SparseLGS采用基于学习的稠密立体模型来处理无位姿和稀疏输入,并引入三步区域匹配方法以解决多视图语义不一致性问题,这对于稀疏输入尤为重要。不同于直接学习高维的CLIP特征,我们提取低维信息并构建双射关系,以避免过度的学习和存储成本。在语义训练中,我们引入重建损失以优化高斯的位置和形状。 据我们所知,SparseLGS是首个针对无位姿稀疏输入的三维语义场问题的研究方法。实验结果表明,与现有最先进方法相比,SparseLGS在仅使用3-4个视图的稀疏输入情况下,能够以较少的输入重建出质量可比的语义场。此外,在相同稀疏输入条件下,SparseLGS的质量显著领先,并大幅提升了计算速度(5倍加速)。