A Self-Adaptive Planning Agent For Multimodal RAG

Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

🌏 The Chinese Web Demo is avaiable at ModelScope now！

We propose OmniSearch, a self-adaptive retrieval agent that plans each retrieval action in real-time according to question solution stage and current retrieval content. As far as we known, OmniSearch is the first planning agent for multimodal RAG.
We reveal that existing VQA-based mRAG benchmarks fail to reflect the feature that real-world questions require dynamic knowledge retrieval, and propose novel Dyn-VQA dataset, which contains three types of dynamic questions.
We benchmark various mRAG methods with leading MLLMs on Dyn-VQA, demonstrating their flaw in providing sufficient and relevant knowledge for dynamic questions.

💡 Perfomance

The performance of various MLLMs with different mRAG strategies are shown below:

More analysis experiments can be found in the paper.

📚 Dyn-VQA Dataset

The json item of Dyn-VQA dataset is organized in the following format:

{
    "image_url": "https://www.pcarmarket.com/static/media/uploads/galleries/photos/uploads/galleries/22387-pasewark-1986-porsche-944/.thumbnails/IMG_7102.JPG.jpg/IMG_7102.JPG-tiny-2048x0-0.5x0.jpg",
    "question": "What is the model of car from this brand?",
    "question_id": 'qid',
    "answer": ["保时捷 944", "Porsche 944."]
}

🔥 The Dyn-VQA will be updated regularly. Laset version: 202412.

🛠 Dependencies

pip install -r requirement.txt

Details

Python = 3.11.9
PyTorch (>= 2.0.0)
pillow = 10.4.0
requests = 2.32.3
google-search-results = 2.4.2
serpapi = 0.1.5

💻 Running OmniSearch

We have release the code of GPT-4V-based OmniSearch for English questions.

Before running, please replace with your own OPENAI key and Google_search key. OPENAI key is at 11-th line of main.py

GPT_API_KEY = "your_actual_key_here"
headers = {
    "Authorization": f"Bearer {GPT_API_KEY}"
}

Google_search key is at 10-th line of search_api.py

API_KEY = "your api-key"

The result is saved to the path:

output_path = os.path.join(meta_save_path, dataset_name, "output_from_gpt4v.jsonl")

Run the main.py file:

python main.py --test_dataset 'path/to/dataset.jsonl' --dataset_name NAME --meta_save_path 'path/to/results'

🔍 Evaluation

The evaluation script for token F1-Recall of the output answers can be used as follows:

python evaluate.py --evaluate_file_path [path to output jsonl file] --lang [language of the
 QA dateset: en/zh]

🔥 TODO

Release code for Qwen-VL-Chat based OmniSearch
Release the corresponding model weight
Create a benchmark for Dyn-VQA

📄 Acknowledge

The repo is contributed by Xinyu Wang, Shuo Guo, Zhen Zhang and Yangning Li.
This work was inspired by ReACT, SelfAsk, FleshLLMs. Sincere thanks for their efforts.

📝 Citation

@article{li2024benchmarkingmultimodalretrievalaugmented,
      title={Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent}, 
      author={Yangning Li and Yinghui Li and Xinyu Wang and Yong Jiang and Zhen Zhang and Xinran Zheng and Hui Wang and Hai-Tao Zheng and Pengjun Xie and Philip S. Yu and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2411.02937},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02937}, 
}

When citing our work, please kindly consider citing the original papers. The relevant citation information is listed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Self-Adaptive Planning Agent For Multimodal RAG

💡 Perfomance

📚 Dyn-VQA Dataset

🛠 Dependencies

Details

💻 Running OmniSearch

🔍 Evaluation

🔥 TODO

📄 Acknowledge

📝 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Self-Adaptive Planning Agent For Multimodal RAG

💡 Perfomance

📚 Dyn-VQA Dataset

🛠 Dependencies

Details

💻 Running OmniSearch

🔍 Evaluation

🔥 TODO

📄 Acknowledge

📝 Citation