Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all the three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are three drawbacks, i.e., 1) the depth and VO results suffer afrom the inherent scale ambiguity issue; 2) the BEV layout is usually estimated separately for roads and vehicles, while the explicit overlay-underlay relations between them are ignored; and 3) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can estimate scale-aware depth and VO as well as BEV layout simultaneously from a monocular video sequence. It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a cross-view and cross-modal transfer (CCT) module is devised to leverage the depth clues for reasoning road and vehicle layout through an attention mechanism. JPerceiver can be trained in an end-to-end multi-task learning way, where the CGT scale loss and CCT module promote inter-task knowledge transfer to benefit feature learning of each task. Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks in terms of accuracy, model size, and inference speed.
- we propose the first joint perception framework JPerceiver for depth, VO and BEV layout estimation simultaneously;
- we design a CGT scale loss to leverage the absolute scale information from the BEV layout to achieve scare-aware depth and VO;
- we devise a CCT module that leverages the depth clues to help reason the spatial relationships between roads and vehicles implicitly, and facilitates the feature learning for BEV layout estimation;
- we conduct extensive experiments on public benchmarks and show that JPerceiver outperforms the state-of-the-art methods on the above three tasks by a large margin.
More details can be found in the paper: JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes (ECCV 2022) by Haimei Zhao, Jing Zhang, Sen Zhang and Dacheng Tao.
We recommend setting up a Python 3.5+ and Pytorch 1.1+ Virtual Environment and installing all the dependencies listed in the requirements file.
git clone https://github.com/sunnyHelen/JPerceiver.git
cd JPerceiver
pip install -r requirements.txt
In the paper, we've presented results for KITTI 3D Object, KITTI Odometry, KITTI RAW, and Argoverse 3D Tracking v1.0 datasets. For comparison with Schulter et. al., We've used the same training and test splits sequences from the KITTI RAW dataset. For more details about the training/testing splits one can look at the splits
directory. And you can download Ground-truth from Monolayout.
If the link of the road label in Monolayout is invalid, please try this: KITTI RAW and KITTI Odometry.
# Download KITTI RAW
./data/download_datasets.sh raw
# Download KITTI 3D Object
./data/download_datasets.sh object
# Download KITTI Odometry
./data/download_datasets.sh odometry
# Download Argoverse Tracking v1.0
./data/download_datasets.sh argoverse
The above scripts will download, unzip and store the respective datasets in the datasets
directory.
datasets/
└── argoverse # argoverse dataset
└── argoverse-tracking
└── train1
└── 1d676737-4110-3f7e-bec0-0c90f74c248f
├── car_bev_gt # Vehicle GT
├── road_gt # Road GT
├── stereo_front_left # RGB image
└── kitti # kitti dataset
└── object # kitti 3D Object dataset
└── training
├── image_2 # RGB image
├── vehicle_256 # Vehicle GT
├── odometry # kitti odometry dataset
└── 00
├── image_2 # RGB image
├── road_dense128 # Road GT
├── raw # kitti raw dataset
└── 2011_09_26
└── 2011_09_26_drive_0001_sync
├── image_2 # RGB image
├── road_dense128 # Road GT
-
Prepare the corresponding dataset
-
Run training
# Training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 25629 train.py --config config/cfg_kitti_baseline_odometry_boundary_ce_iou_1024_20.py --work_dir log/odometry/
-
Choose different config file and log directory for different datasets and training settings.
-
The evaluation of BEV layout is conducted during training, which can be found in respective "xxx.log.json" files.
- Prepare the corresponding dataset
- Download pre-trained models
- Run evaluation
# Evaluate depth results
python scripts/eval_depth_eigen.py
# Evaluate VO results
python scripts/draw_odometry.py
The following table provides links to the pre-trained models for each dataset mentioned in our paper. The table also shows the corresponding evaluation results for these models.
Dataset | Segmentation Objects | mIOU(%) | mAP(%) | Pretrained Model |
---|---|---|---|---|
KITTI 3D Object | Vehicle | 40.85 | 57.23 | link |
KITTI Odometry | Road | 78.13 | 89.57 | link |
KITTI Raw | Road | 66.39 | 86.17 | link |
Argoverse Tracking | Vehicle | 49.45 | 65.84 | link |
Argoverse Tracking | Road | 77.50 | 90.21 | link |
Results of depth estimation on eigen split
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | Scaling ratios | Pretrained Model |
---|---|---|---|---|---|---|---|---|
0.116 | 0.984 | 5.022 | 0.193 | 0.875 | 0.959 | 0.982 | 1.074 ± 0.077 | link |
### Draw trajectories
python scripts/plot_kitti.py
### Prediction video generation
# kitti
python eval_kitti_video.py
# Argoverse
eval_argo_both_video.py
If you meet any problems, please describe them in issues or contact:
- Haimei Zhao: [email protected]
This project is released under the MIT License (refer to the LICENSE file for details). Thanks for the open-source related works. This project partially depends on the sources of Monolayout, PYVA, Monodepth2 and FeatDepth
If you find our work useful for your research, please consider citing the paper
@article{zhao2022jperceiver,
title={JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes},
author={Zhao, Haimei and Zhang, Jing and Zhang, Sen and Tao, Dacheng},
journal={arXiv preprint arXiv:2207.07895},
year={2022}
}