I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions
Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, Lan Xu*
- Nov, 2024: 🔈🔈 Download or view online our videos and segmentations!
- Sep, 2024: 🔈🔈 The instance-level segmentations have been released!
- July, 2024: 🔈🔈 The raw videos have been released!
- May, 2024: 🔈🔈 The 32-view 2D & 3D human keypoints have been released!
- March, 2024: 🎉🎉 I'M HOI is accepted to CVPR 2024!
-
Jan. 04, 2024: 🔈🔈 Fill out the form to have access to IMHD
$^2$ !
IMHD
- Human motion annotation in SMPL-H format, built on EasyMocap
- Object motion annotation, built on PHOSA
- Well-scanned object geometry, using Polycam
- Object-mounted IMU sensor measurement, using Movella DOT
- 32-view RGB videos & instance-level segmentations, built on SAM, Track-Anything and XMem
- 32-view 2D&3D human keypoints detection, using ViTPose and MediaPipe
data/
|--calibrations/ # camera intrinsics and world-to-cam extrinsics
|--object_templates/ # raw and downsampled geometry
|--imu_preprocessed/ # pre-processed IMU signal
|--keypoints2d/ # body keypoints in OP25 format and hand keypoints in MediaPipe format
|--keypoints3d/ # body keypoints in OP25 format and hand keypoints in MediaPipe format
|--video_release/ # raw videos from 32 multiple views
|--mask_release/ # human and object separate segmentations from 32 multiple views
|--ground_truth/ # human motion in SMPL-H format and rigid object motion
|----<date>/
|------<segment_name>/
|--------<sequence_name>/
|----------gt_<part_id>_<start>_<end>.pkl
All sub-folders have the similar detailed structure as the shown one of ground truth. Particularly, since motion annotations of some part in some sequence are not ideal, there may exist several .pkl
files under one sequence folder. To parse the file name meaning of leaf .pkl
files, here is an example: gt_0_10_100.pkl: the first motion part which starts from frame_10 and ends at frame_100
.
We tested our code on Windows 10
, Windows 11
, Ubuntu 18.04 LTS
and Ubuntu 20.04 LTS
.
All dependencies:
python>=3.8
CUDA=11.7
torch=1.13.0
pytorch3d
opencv-python
matplotlib
smplx
conda create -n imhd2 python=3.8 -y
conda activate imhd2
conda install pytorch=1.13.0 torchvision pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
git clone https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d && pip install -e . --ignore-installed PyYAML
conda create -n imhd2 python=3.8 -y
conda activate imhd2
conda install --file conda_install_cuda117_pakage.txt -c nvidia
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
- Prepare data. Download IMHD
$^2$ from here and place it under the root directory in the pre-defined structure. - Prepare body model. Download SMPL-H (the extended SMPL+H model) and put the model files under the
body_model/
folder. Overall, the structure ofbody_model/
folder should be:
body_model/
|--README.md
|--__init__.py
|--body_model.py
|--utils.py
|--smplh/
|----info.txt
|----LICENSE.txt
|----female/
|------model.npz
|----male/
|------model.npz
|----neutral/
|------model.npz
- Run
python visualization.py
to check how to load and visualize IMHD$^2$ . Results will be saved invisualizations/
.
Q1: Which coordinate are the ground-truth motions in? How to align all the motions across different dates?
A1: The ground-truth motions are in the world coordinate which was calibrated using multi-camera system and may different across dates. To align them, you can use the provided camera parameters in calibrations/
to transform all motion data to camera coordinate.
Q2: Which category of object does the motions named with 'bat' in 20230825/
and 20230827/
interact with?
A2: The interacting object category of motions in 20230825/
and 20230827/
is baseball bat, corresponding to 'baseball' in the object_templates/
folder.
Q3: Which camera serves as the main view?
A3: The main view is from the camera labeled with '1'(starting from 0).
Q3: How to decode the raw videos to images?
A3: Please use the command: ffmpeg -i <input_path> -qscale:v 2 -f image2 -v error -start_number 0 -threads 64 output/%06d.jpg
If you find our data or paper helps, please consider citing:
@InProceedings{zhao2024imhoi,
author = {Zhao, Chengfeng and Zhang, Juze and Du, Jiashen and Shan, Ziwei and Wang, Junye and Yu, Jingyi and Wang, Jingya and Xu, Lan},
title = {I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {729-741}
}
This work was supported by National Key R&D Program of China (2022YFF0902301), Shanghai Local college capacity building program (22010502800). We also acknowledge support from Shanghai Frontiers Science Center of Human-centered Artificial Intelligence (ShangHAI).
We thank Jingyan Zhang and Hongdi Yang for settting up the capture system. We thank Jingyan Zhang, Zining Song, Jierui Xu, Weizhi Wang, Gubin Hu, Yelin Wang, Zhiming Yu, Xuanchen Liang, af and zr for data collection. We thank Xiao Yu, Yuntong Liu and Xiaofan Gu for data checking and annotations.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.