DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Paper

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Xianda Guo*, Ruijun Zhang*, Yiqun Duan*, Yuhang He, Chenming Zhang, Long Chen.

News

[2024/11] Paper released on arXiv.

Overall

Getting Started

0. Prepare Dataset

We are using the Hugging Face dataset DriveMLLM for evaluation. The images are sourced from the validation set of nuScenes. We have provided a metadata.jsonl file for all images, allowing users to easily access properties such as bboxes2D.

1. Setup Environment

2. VQAs generation

Run the following code to download the dataset, generate the VQAs, and save them in the eval_vqas folder.

python hfdata_to_VQA.py

3. Inference

Run inference according to your requirements:

For GPT API calls:

export OPENAI_API_KEY=your_api_key

python inference/get_mllm_output.py \
--model_type gpt \
--model gpt-4o \
--vqas_dir eval_vqas \
--save_dir inference/mllm_outputs

For Gemini API calls:

export GOOGLE_API_KEY=your_api_key

python inference/get_mllm_output.py \
--model_type gemini \
--model models/gemini-2.0-flash \
--vqas_dir eval_vqas \
--save_dir inference/mllm_outputs

For Local LLaVA-Next inference:

python inference/get_mllm_output.py \
--model_type llava \
--model lmms-lab/llava-onevision-qwen2-7b-si \
--vqas_dir eval_vqas \
--save_dir inference/mllm_outputs

For Local QWen2-VL inference:

python inference/get_mllm_output.py \
--model_type qwen \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--vqas_dir eval_vqas \
--save_dir inference/mllm_outputs

After executing the script, the results will be saved in the directory: {save_dir}/{model_type}/{model}.

Citation

@article{DriveMLLM,
        title={DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving},
        author={Guo, Xianda and Zhang Ruijun and Duan Yiqun and He Yuhang and Zhang, Chenming and Chen, Long},
        journal={arXiv preprint arXiv:2411.13112},
        year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
inference		inference
prompt		prompt
README.md		README.md
hfdata_to_VQA.py		hfdata_to_VQA.py
radar.jpg		radar.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Paper

News

Overall

Getting Started

0. Prepare Dataset

1. Setup Environment

2. VQAs generation

3. Inference

Citation

About

Releases

Packages

Contributors 3

Languages

XiandaGuo/Drive-MLLM

Folders and files

Latest commit

History

Repository files navigation

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Paper

News

Overall

Getting Started

0. Prepare Dataset

1. Setup Environment

2. VQAs generation

3. Inference

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages