Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics.
机器人与物体交互的视频包含丰富的物体动态信息。然而,现有的视频预测方法通常未能明确利用视频中的三维信息(如机器人动作和物体的三维状态),从而限制了其在真实机器人应用中的使用。本文提出了一种框架,通过明确考虑机器人的动作轨迹及其对场景动态的影响,从多视角 RGB 视频中直接学习物体的动态。我们利用三维高斯喷涂 (3D Gaussian Splatting, 3DGS) 的三维高斯表示,通过图神经网络训练基于粒子的动态模型。该模型在从密集追踪的三维高斯重建中下采样的稀疏控制粒子上运行。 通过对离线机器人交互数据学习神经动态模型,我们的方法能够预测物体在不同初始配置和未见过的机器人动作下的运动。高斯的三维变换可以通过控制粒子的运动进行插值,从而实现物体未来状态的预测渲染,达成基于动作的条件视频预测。该动态模型还可用于基于模型的规划框架,以完成物体操作任务。我们在包括绳索、衣物和填充玩具等多种可变形材料上进行了实验,验证了该框架在建模复杂形状和动态方面的能力。