Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
单张图像的3D重建是计算机视觉领域的一个基础挑战,由于固有的几何模糊性和有限的视角信息,使得这一任务难以解决。最近,潜在视频扩散模型(Latent Video Diffusion Models, LVDMs)在从大规模视频数据中学习3D先验方面展现出巨大潜力。然而,要有效利用这些先验,仍面临三个关键挑战:(1) 大范围相机运动导致的质量下降,(2) 难以实现精确的相机控制,(3) 扩散过程中固有的几何失真破坏了3D一致性。 为了解决这些问题,我们提出了 LiftImage3D 框架,能够有效释放 LVDMs 的生成先验,同时确保3D一致性。具体而言,我们设计了一种关节轨迹策略(articulated trajectory strategy),用于生成视频帧,将大范围的相机运动分解为可控的小范围运动序列。接着,我们采用强大的神经匹配模型(如 MASt3R)对生成的帧进行相机位姿校准,并生成相应的点云。 最后,我们提出了一种失真感知的3D高斯投影表示(distortion-aware 3D Gaussian splatting representation),能够在帧之间学习独立的失真,并输出无失真的规范高斯表示。广泛的实验表明,LiftImage3D 在两个具有挑战性的数据集(LLFF、DL3DV 和 Tanks and Temples)上实现了当前最先进的性能,并且能够很好地泛化到多样的真实场景和野外图像,从卡通插图到复杂的现实场景均表现优异。