👀 TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

A novel evaluation benchmark for spatial reasoning of vision-language models.

📄 [Arxiv] · 🕸️ [Project Page] · 🤗 [Data]

Key takeaways

Define top-view spatial reasoning task for VLMs via 4 carefully designed tasks of increasing complexity, also encompassing 9 distinct fine-grained sub-tasks with a structured design of the questions focusing on different model abilities.

Collect TopViewRS Dataset (Top-View Reasoning in Space), comprising 11,384 multiple-choice questions with either photo-realistic or semantic top-view maps of real-world scenarios

Investigate 10 VLMs from different model families and sizes, highlighting the performance gap compared to human annotators.

Dataset

Part of the benchmark is now available on Huggingface: https://huggingface.co/datasets/chengzu/topviewrs.

Code

Coming soon.

Citation

If you find TopViewRS useful:

@misc{li2024topviewrs,
      title={TopViewRS: Vision-Language Models as Top-View Spatial Reasoners}, 
      author={Chengzu Li and Caiqi Zhang and Han Zhou and Nigel Collier and Anna Korhonen and Ivan Vulić},
      year={2024},
      eprint={2406.02537},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

👀 TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

A novel evaluation benchmark for spatial reasoning of vision-language models.

Key takeaways

Dataset

Code

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

👀 TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

A novel evaluation benchmark for spatial reasoning of vision-language models.

Key takeaways

Dataset

Code

Citation