[CVPRW 2024] Divide and Conquer Boosting for Enhanced Traffic Safety Description and Analysis with Large Vision Language Model
Khai Trinh Xuan, Khoi Nguyen Nguyen, Bach Hoang Ngo, Vu Dinh Xuan, Minh-Hung An, Quang-Vinh Dinh
⭐ The 2nd Place Solution to The 8th NVIDIA AI City Challenge (2024) Track 2 from AIO_ISC Team.
Rank | Team ID | Team name | MRR Score |
---|---|---|---|
1 | 208 | AliOpenTrek | 33.4308 |
2 | 28 | AIO_ISC (Ours) | 32.8877 |
3 | 68 | Lighthouse | 32.3006 |
4 | 87 | VAI | 32.2778 |
5 | 184 | Santa Claude | 29.7838 |
├── src
│ ├── preprocess
│ | ├── extract_frames
│ | ├── segment_extraction
│ ├── train
│ | ├── Qwen-VL
│ | ├── prepare_train_data
│ ├── inference
│ ├── postprocess
│ ├── evaluation
├── tools
├── aux_dataset
│ ├── results
│ ├── submission
│ ├── train_data
│ ├── extracted_frames
│ ├── segmentation_data
├── dataset
Please download WTS dataset and set up the dataset as follow:
├── dataset
│ ├── annotations
│ | ├── bbox_annotated
│ | | ├── pedestrian
│ | | | ├── train
│ | | | ├── val
│ | | | ├── test
│ | | ├── vehicle
│ | ├── bbox_generated
│ | | ├── ... (same structure)
│ | ├── caption
│ | | | ├── train
│ | | | ├── val
│ | | | ├── test
│ | videos
│ | | ├── train
│ | | ├── val
│ | | ├── test
│ | external
│ | | ├── BDD_PC_5K
│ | | | ├── annotations
| | | │ | ├── bbox_annotated
| | │ | | | ├── train
| | │ | | | ├── val
| | │ | | | ├── test
| | | │ | ├── bbox_generated
| | │ | | ├── ... (same structure)
| | | │ | ├── caption
| | │ | | ├── ... (same structure)
│ | | | ├── videos
│ | | | | ├── train
│ | | | | ├── val
│ | | | | ├── test
pip install -r requirements.txt
Run the following instructions to create our final submission or you can download our aux_dataset (not including extracted_frames) folder here. Inorder to run the repo, the user should use Nvidia GPU which has ampere architecture (rtx 3000 series, A5000, A6000,...) or higher (Hopper, Blackwell).
Extracting video frames:
sh tools/extract_frames.sh
Segment Extraction: please create a new environment follow here and run:
sh tools/segment_extraction.sh
Remember to grant permisson to access mistralai/Mistral-7B-Instruct-v0.2 model on the huggingface hub.
Prepare train data:
sh tools/prepare_train_data.sh
Training: Set the correct train and eval data path and run the code in here.
Inference trained model on test set:
sh tools/inference.sh
The pretrained checkpoints uploaded to huggingface hub are listed in here.
sh tools/postprocess.sh
After run postprocessing, you can submit the file aux_dataset/submission.json to the official evaluation server.
We follow wts-dataset repo and reimplement the fast version at here.
We provide Dockerfile to build Segment Extraction environment :
sudo docker build -t segment_extraction .
sudo docker run -it --gpus all -v ./:/home/code -w /home/code segment_extraction
If you have any questions, please leave an issue or contact us: [email protected]
@InProceedings{Xuan_2024_CVPR,
author = {Xuan, Khai Trinh and Nguyen, Khoi Nguyen and Ngo, Bach Hoang and Xuan, Vu Dinh and An, Minh-Hung and Dinh, Quang-Vinh},
title = {Divide and Conquer Boosting for Enhanced Traffic Safety Description and Analysis with Large Vision Language Model},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2024},
pages = {7046-7055}
}
Our VLM training code relies on Qwen-VL repo. Thanks for their great work!