Reconstructing realistic 3D human models from monocular images has significant applications in creative industries, human-computer interfaces, and healthcare. We base our work on 3D Gaussian Splatting (3DGS), a scene representation composed of a mixture of Gaussians. Predicting such mixtures for a human from a single input image is challenging, as it is a non-uniform density (with a many-to-one relationship with input pixels) with strict physical constraints. At the same time, it needs to be flexible to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate density and approximate initial position for Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other Gaussians' attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve fast inference of 3D human models from a single image without test-time optimization, expensive diffusion models, or 3D points supervision. We also show that it can improve 3D pose estimation by better fitting human models that account for clothes and other variations.
从单目图像重建逼真的3D人体模型在创意产业、人机交互和医疗保健等领域具有重要应用。我们的工作基于3D高斯分裂(3DGS),这是一种由高斯混合体构成的场景表示。对于单张输入图像预测这样的混合体是具有挑战性的,因为它是非均匀密度(与输入像素之间存在多对一的关系)并且受到严格的物理约束。同时,该方法需要灵活以适应多样的衣物和姿态。我们的关键观察是,标准化人体网格(如SMPL)的顶点能够为高斯体提供足够的密度并近似初始位置。接着,我们可以训练一个Transformer模型,联合预测这些位置的较小调整,以及其他高斯体属性和SMPL参数。我们通过实验证明,这种组合(仅使用多视图监督)能够在不需要测试时优化、昂贵的扩散模型或3D点监督的情况下,实现单张图像的快速3D人体模型推理。我们还展示了,该方法通过更好地拟合考虑衣物和其他变化的人体模型,可以提高3D姿态估计的准确性。