In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.
在本文中,我们提出了一种从单目视频输入中重建三维世界和多个动态人体的方法。作为一个关键思想,我们通过最新出现的三维高斯涂抹(3D-GS)表示来表现世界和多个人体,这使得将它们一起方便高效地组合和渲染成为可能。特别是,我们解决了在三维人体重建中遇到的一个常见挑战——严重有限和稀疏的观察情况。为了应对这一挑战,我们引入了一种在规范空间中优化3D-GS表示的新方法,通过融合公共空间中的稀疏线索,我们利用预训练的二维扩散模型来合成未见视图,同时保持与观察到的二维外观的一致性。我们展示了我们的方法能够在存在遮挡、图像裁剪、少样本以及极端稀疏观察的各种挑战性示例中重建高质量的可动画三维人体。重建后,我们的方法不仅能在任何新视角和任意时间点渲染场景,还能通过移除个别人体或为每个人体应用不同的动作来编辑三维场景。通过各种实验,我们展示了我们的方法相比现有的替代方法的质量和效率。