Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data
Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang
[Project Page
] [arXiv
] [BibTeX
]
We propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data.
We follow the main dependencies of Cube R-CNN and have added dependencies for UniDepth and Grounded-SAM.
# setup new evironment
conda create -n ovm3d python=3.10
conda activate ovm3d
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c fvcore -c iopath -c conda-forge -c pytorch3d -c pytorch fvcore iopath pytorch3d
# OpenCV, COCO, detectron2
pip install cython opencv-python
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
cd third_party
git clone [email protected]:facebookresearch/detectron2.git
python -m pip install -e detectron2
# other dependencies
conda install -c conda-forge scipy seaborn
# install dependencies for Unidepth
cd UniDepth
pip install -e .
# install dependencies for Grounded-Segment-Anything
cd ../Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install scikit-learn
We utilize four datasets from Omni3D: KITTI
, nuScenes
, SUN RGB-D
and ARKitScenes
. For detailed instructions on downloading and setting up the images and annotations, please refer to the Omni3D data guide.
We provide pre-generated pseudo labels here for the training and validation sets. Please place them in the datasets
folder. To generate pseudo labels yourself, follow these steps by running:
bash scripts/generate_pseudo_label.sh DATASET
Specifically:
DATASET=$1
# Step 1: Predict depth using UniDepth
CUDA_VISIBLE_DEVICES=0 python third_party/UniDepth/run_unidepth.py --dataset $DATASET
# Step 2: Segment novel objects also the ground using Grounded-SAM
CUDA_VISIBLE_DEVICES=0 python third_party/Grounded-Segment-Anything/grounded_sam_detect.py --dataset $DATASET
CUDA_VISIBLE_DEVICES=0 python third_party/Grounded-Segment-Anything/grounded_sam_detect_ground.py --dataset $DATASET
# Step 3: Generate pseudo 3D bounding boxes
python tools/generate_pseudo_bbox.py \
--config-file configs/Base_Omni3D_${DATASET}.yaml \
OUTPUT_DIR output/generate_pseudo_label \
# Step 4: Convert to COCO dataset format
python tools/transform_to_coco.py --dataset_name $DATASET
Replace DATASET
with the name of the dataset you are working with.
To evaluate the trained models, download the pre-trained models and place them in the checkpoints folder.
Datasets | Link |
---|---|
KITTI | Google Drive |
nuScenes | Google Drive |
SUNRGBD | Google Drive |
ARKitScenes | Google Drive |
bash scripts/test.sh DATASET
To train the model from scratch, run:
bash scripts/train.sh DATASET
If you find this repo helpful, please consider citing us.
@inproceedings{huang2024training,
title={Training an Open-Vocabulary Monocular 3D Detection Model without 3D Data},
author={Rui Huang and Henry Zheng and Yan Wang and Zhuofan Xia and Marco Pavone and Gao Huang},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
}
We build upon the source code of Cube R-CNN, UniDepth, Grounded-SAM, WeakM3D, and OV-3DET. We sincerely thank the authors for their efforts.