Skip to content

Latest commit

 

History

History
41 lines (33 loc) · 1.64 KB

README.md

File metadata and controls

41 lines (33 loc) · 1.64 KB

👀 TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

A novel evaluation benchmark for spatial reasoning of vision-language models.

📄 [Arxiv] · 🕸️ [Project Page] · 🤗 [Data]

Key takeaways

  • Define top-view spatial reasoning task for VLMs via 4 carefully designed tasks of increasing complexity, also encompassing 9 distinct fine-grained sub-tasks with a structured design of the questions focusing on different model abilities.
  • Collect TopViewRS Dataset (Top-View Reasoning in Space), comprising 11,384 multiple-choice questions with either photo-realistic or semantic top-view maps of real-world scenarios
  • Investigate 10 VLMs from different model families and sizes, highlighting the performance gap compared to human annotators.

sicl

Dataset

Part of the benchmark is now available on Huggingface: https://huggingface.co/datasets/chengzu/topviewrs.

Code

Coming soon.

Citation

If you find TopViewRS useful:

@misc{li2024topviewrs,
      title={TopViewRS: Vision-Language Models as Top-View Spatial Reasoners}, 
      author={Chengzu Li and Caiqi Zhang and Han Zhou and Nigel Collier and Anna Korhonen and Ivan Vulić},
      year={2024},
      eprint={2406.02537},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}