diff --git a/arxiv_trends_year.png b/arxiv_trends_year.png
index bda32a8..0986af6 100644
Binary files a/arxiv_trends_year.png and b/arxiv_trends_year.png differ
diff --git a/arxiv_visual_reasoning.jsonl b/arxiv_visual_reasoning.jsonl
index ef6c384..5c680b9 100644
--- a/arxiv_visual_reasoning.jsonl
+++ b/arxiv_visual_reasoning.jsonl
@@ -1,7 +1,15 @@
+{"entry_id": "2308.10562", "title": "Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories", "authors": ["Delfina Sol Martinez Pandiani", "Valentina Presutti"], "published": "2023-08-21 08:37:04", "updated": "2024-02-29 16:18:45", "summary": "The field of Computer Vision (CV) is increasingly shifting towards\n``high-level'' visual sensemaking tasks, yet the exact nature of these tasks\nremains unclear and tacit. This survey paper addresses this ambiguity by\nsystematically reviewing research on high-level visual understanding, focusing\nparticularly on Abstract Concepts (ACs) in automatic image classification. Our\nsurvey contributes in three main ways: Firstly, it clarifies the tacit\nunderstanding of high-level semantics in CV through a multidisciplinary\nanalysis, and categorization into distinct clusters, including commonsense,\nemotional, aesthetic, and inductive interpretative semantics. Secondly, it\nidentifies and categorizes computer vision tasks associated with high-level\nvisual sensemaking, offering insights into the diverse research areas within\nthis domain. Lastly, it examines how abstract concepts such as values and\nideologies are handled in CV, revealing challenges and opportunities in\nAC-based image classification. Notably, our survey of AC image classification\ntasks highlights persistent challenges, such as the limited efficacy of massive\ndatasets and the importance of integrating supplementary information and\nmid-level features. We emphasize the growing relevance of hybrid AI systems in\naddressing the multifaceted nature of AC image classification tasks. Overall,\nthis survey enhances our understanding of high-level visual reasoning in CV and\nlays the groundwork for future research endeavors.", "comment": "Preprint", "links": []}
+{"entry_id": "2310.04671", "title": "Visual Abductive Reasoning Meets Driving Hazard Prediction", "authors": ["Korawat Charoenpitaks", "Van-Quang Nguyen", "Masanori Suganuma", "Masahiro Takahashi", "Ryoma Niihara", "Takayuki Okatani"], "published": "2023-10-07 03:16:30", "updated": "2024-02-27 14:22:09", "summary": "This paper addresses the problem of predicting hazards that drivers may\nencounter while driving a car. We formulate it as a task of anticipating\nimpending accidents using a single input image captured by car dashcams. Unlike\nexisting approaches to driving hazard prediction that rely on computational\nsimulations or anomaly detection from videos, this study focuses on high-level\ninference from static images. The problem needs predicting and reasoning about\nfuture events based on uncertain observations, which falls under visual\nabductive reasoning. To enable research in this understudied area, a new\ndataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is\ncreated. The dataset consists of 15K dashcam images of street scenes, and each\nimage is associated with a tuple containing car speed, a hypothesized hazard\ndescription, and visual entities present in the scene. These are annotated by\nhuman annotators, who identify risky scenes and provide descriptions of\npotential accidents that could occur a few seconds later. We present several\nbaseline methods and evaluate their performance on our dataset, identifying\nremaining issues and discussing future directions. This study contributes to\nthe field by introducing a novel problem formulation and dataset, enabling\nresearchers to explore the potential of multi-modal AI for driving hazard\nprediction.", "comment": "Main Paper: 10 pages, Supplementary Materials: 28 pages", "links": []}
+{"entry_id": "2402.14818", "title": "PALO: A Polyglot Large Multimodal Model for 5B People", "authors": ["Muhammad Maaz", "Hanoona Rasheed", "Abdelrahman Shaker", "Salman Khan", "Hisham Cholakal", "Rao M. Anwer", "Tim Baldwin", "Michael Felsberg", "Fahad S. Khan"], "published": "2024-02-22 18:59:58", "updated": "2024-02-22 18:59:58", "summary": "In pursuit of more inclusive Vision-Language Models (VLMs), this study\nintroduces a Large Multilingual Multimodal Model called \\textsc{Palo}.\n\\textsc{Palo} offers visual reasoning capabilities in 10 major languages,\nincluding English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian,\nUrdu, and Japanese, that span a total of $\\sim$5B people (65\\% of the world\npopulation). Our approach involves a semi-automated translation approach to\nadapt the multimodal instruction dataset from English to the target languages\nusing a fine-tuned Large Language Model, thereby ensuring high linguistic\nfidelity while allowing scalability due to minimal manual effort. The\nincorporation of diverse instruction sets helps us boost overall performance\nacross multiple languages especially those that are underrepresented like\nHindi, Arabic, Bengali, and Urdu. The resulting models are trained across three\nscales (1.7B, 7B and 13B parameters) to show the generalization and scalability\nwhere we observe substantial improvements compared to strong baselines. We also\npropose the first multilingual multimodal benchmark for the forthcoming\napproaches to evaluate their vision-language reasoning capabilities across\nlanguages. Code: https://github.com/mbzuai-oryx/PALO.", "comment": "Technical Report of PALO", "links": []}
+{"entry_id": "2402.12675", "title": "Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach", "authors": ["Guillermo Puebla", "Jeffrey S. Bowers"], "published": "2024-02-20 02:48:14", "updated": "2024-02-20 02:48:14", "summary": "Achieving visual reasoning is a long-term goal of artificial intelligence. In\nthe last decade, several studies have applied deep neural networks (DNNs) to\nthe task of learning visual relations from images, with modest results in terms\nof generalization of the relations learned. However, in recent years,\nobject-centric representation learning has been put forward as a way to achieve\nvisual reasoning within the deep learning framework. Object-centric models\nattempt to model input scenes as compositions of objects and relations between\nthem. To this end, these models use several kinds of attention mechanisms to\nsegregate the individual objects in a scene from the background and from other\nobjects. In this work we tested relation learning and generalization in several\nobject-centric models, as well as a ResNet-50 baseline. In contrast to previous\nresearch, which has focused heavily in the same-different task in order to\nasses relational reasoning in DNNs, we use a set of tasks -- with varying\ndegrees of difficulty -- derived from the comparative cognition literature. Our\nresults show that object-centric models are able to segregate the different\nobjects in a scene, even in many out-of-distribution cases. In our simpler\ntasks, this improves their capacity to learn and generalize visual relations in\ncomparison to the ResNet-50 baseline. However, object-centric models still\nstruggle in our more difficult tasks and conditions. We conclude that abstract\nvisual reasoning remains an open challenge for DNNs, including object-centric\nmodels.", "comment": "16 pages, 14 figures", "links": []}
+{"entry_id": "2401.15847", "title": "Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA", "authors": ["Yue Fan", "Jing Gu", "Kaiwen Zhou", "Qianqi Yan", "Shan Jiang", "Ching-Chen Kuo", "Xinze Guan", "Xin Eric Wang"], "published": "2024-01-29 02:43:40", "updated": "2024-02-19 05:14:56", "summary": "Multipanel images, commonly seen as web screenshots, posters, etc., pervade\nour daily lives. These images, characterized by their composition of multiple\nsubfigures in distinct layouts, effectively convey information to people.\nToward building advanced multimodal AI applications, such as agents that\nunderstand complex scenes and navigate through webpages, the skill of\nmultipanel visual reasoning is essential, and a comprehensive evaluation of\nmodels in this regard is important. Therefore, we introduce Multipanel Visual\nQuestion Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets\nof questions, answers, and multipanel images that specifically challenge models\nin comprehending multipanel images. Our evaluation shows that questions in the\nMultipanelVQA benchmark pose significant challenges to the state-of-the-art\nLarge Vision Language Models (LVLMs) tested, even though humans can attain\napproximately 99\\% accuracy on these questions. Distinctively, the\nMultipanelVQA benchmark features synthetically generated multipanel images\nspecifically crafted to isolate and assess the impact of various factors, such\nas the layout, on LVLMs' multipanel image comprehension abilities. As a result,\nin addition to benchmarking the capabilities of LVLMs in understanding\nmultipanel images, we analyze the potential causes for LVLMs' performance and\noffer insights for enhancement with the synthetic data. Code and data are\nreleased at https://sites.google.com/view/multipanelvqa/home.", "comment": null, "links": []}
+{"entry_id": "2402.11574", "title": "Visual In-Context Learning for Large Vision-Language Models", "authors": ["Yucheng Zhou", "Xiang Li", "Qianning Wang", "Jianbing Shen"], "published": "2024-02-18 12:43:38", "updated": "2024-02-18 12:43:38", "summary": "In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning\n(ICL) remains limited by challenges in cross-modal interactions and\nrepresentation disparities. To overcome these challenges, we introduce a novel\nVisual In-Context Learning (VICL) method comprising Visual Demonstration\nRetrieval, Intent-Oriented Image Summarization, and Intent-Oriented\nDemonstration Composition. Our approach retrieves images via ''Retrieval &\nRerank'' paradigm, summarises images with task intent and task-specific visual\nparsing, and composes language-based demonstrations that reduce token count and\nalleviate cross-modal interaction problem. Experimental evaluations on five\nvisual reasoning datasets demonstrate the effectiveness of our method.\nMoreover, our extensive experiments leverage information flow analysis to\nelucidate the effectiveness of our method, and investigate the impact of length\nand position of demonstrations for LVLM. The use of in-context unlearning\nfurther shows promise in resetting specific model knowledge without retraining.", "comment": "13 pages, 7 figures", "links": []}
+{"entry_id": "2402.04236", "title": "CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations", "authors": ["Ji Qi", "Ming Ding", "Weihan Wang", "Yushi Bai", "Qingsong Lv", "Wenyi Hong", "Bin Xu", "Lei Hou", "Juanzi Li", "Yuxiao Dong", "Jie Tang"], "published": "2024-02-06 18:43:48", "updated": "2024-02-06 18:43:48", "summary": "Vision-Language Models (VLMs) have demonstrated their widespread viability\nthanks to extensive training in aligning visual instructions to answers.\nHowever, this conclusive alignment leads models to ignore critical visual\nreasoning, and further result in failures on meticulous visual problems and\nunfaithful responses. In this paper, we propose Chain of Manipulations, a\nmechanism that enables VLMs to solve problems with a series of manipulations,\nwhere each manipulation refers to an operation on the visual input, either from\nintrinsic abilities (e.g., grounding) acquired through prior training or from\nimitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs\nto generate faithful responses with evidential visual reasoning, and permits\nusers to trace error causes in the interpretable paths. We thus train CogCoM, a\ngeneral 17B VLM with a memory-based compatible architecture endowed this\nreasoning mechanism. Experiments show that our model achieves the\nstate-of-the-art performance across 8 benchmarks from 3 categories, and a\nlimited number of training steps with the data swiftly gains a competitive\nperformance. The code and data are publicly available at\nhttps://github.com/THUDM/CogCoM.", "comment": "17 pages, 7 figures", "links": []}
+{"entry_id": "2402.03507", "title": "Neural networks for abstraction and reasoning: Towards broad generalization in machines", "authors": ["Mikel Bober-Irizar", "Soumya Banerjee"], "published": "2024-02-05 20:48:57", "updated": "2024-02-05 20:48:57", "summary": "For half a century, artificial intelligence research has attempted to\nreproduce the human qualities of abstraction and reasoning - creating computer\nsystems that can learn new concepts from a minimal set of examples, in settings\nwhere humans find this easy. While specific neural networks are able to solve\nan impressive range of problems, broad generalisation to situations outside\ntheir training data has proved elusive.In this work, we look at several novel\napproaches for solving the Abstraction & Reasoning Corpus (ARC), a dataset of\nabstract visual reasoning tasks introduced to test algorithms on broad\ngeneralization. Despite three international competitions with $100,000 in\nprizes, the best algorithms still fail to solve a majority of ARC tasks and\nrely on complex hand-crafted rules, without using machine learning at all. We\nrevisit whether recent advances in neural networks allow progress on this task.\n First, we adapt the DreamCoder neurosymbolic reasoning solver to ARC.\nDreamCoder automatically writes programs in a bespoke domain-specific language\nto perform reasoning, using a neural network to mimic human intuition. We\npresent the Perceptual Abstraction and Reasoning Language (PeARL) language,\nwhich allows DreamCoder to solve ARC tasks, and propose a new recognition model\nthat allows us to significantly improve on the previous best implementation.We\nalso propose a new encoding and augmentation scheme that allows large language\nmodels (LLMs) to solve ARC tasks, and find that the largest models can solve\nsome ARC tasks. LLMs are able to solve a different group of problems to\nstate-of-the-art solvers, and provide an interesting way to complement other\napproaches. We perform an ensemble analysis, combining models to achieve better\nresults than any system alone. Finally, we publish the arckit Python library to\nmake future research on ARC easier.", "comment": "32 pages main text, 17 pages", "links": []}
+{"entry_id": "2401.04181", "title": "Language-Conditioned Robotic Manipulation with Fast and Slow Thinking", "authors": ["Minjie Zhu", "Yichen Zhu", "Jinming Li", "Junjie Wen", "Zhiyuan Xu", "Zhengping Che", "Chaomin Shen", "Yaxin Peng", "Dong Liu", "Feifei Feng", "Jian Tang"], "published": "2024-01-08 19:00:32", "updated": "2024-02-01 08:32:33", "summary": "The language-conditioned robotic manipulation aims to transfer natural\nlanguage instructions into executable actions, from simple pick-and-place to\ntasks requiring intent recognition and visual reasoning. Inspired by the dual\nprocess theory in cognitive science, which suggests two parallel systems of\nfast and slow thinking in human decision-making, we introduce Robotics with\nFast and Slow Thinking (RFST), a framework that mimics human cognitive\narchitecture to classify tasks and makes decisions on two systems based on\ninstruction types. Our RFST consists of two key components: 1) an instruction\ndiscriminator to determine which system should be activated based on the\ncurrent user instruction, and 2) a slow-thinking system that is comprised of a\nfine-tuned vision language model aligned with the policy networks, which allows\nthe robot to recognize user intention or perform reasoning tasks. To assess our\nmethodology, we built a dataset featuring real-world trajectories, capturing\nactions ranging from spontaneous impulses to tasks requiring deliberate\ncontemplation. Our results, both in simulation and real-world scenarios,\nconfirm that our approach adeptly manages intricate tasks that demand intent\nrecognition and reasoning. The project is available at\nhttps://jlm-z.github.io/RSFT/", "comment": "accepted to ICRA2024", "links": []}
{"entry_id": "2308.08334", "title": "Learning logic programs by discovering higher-order abstractions", "authors": ["Céline Hocquette", "Sebastijan Dumančić", "Andrew Cropper"], "published": "2023-08-16 12:50:10", "updated": "2024-01-29 18:34:39", "summary": "We introduce the higher-order refactoring problem, where the goal is to\ncompress a logic program by discovering higher-order abstractions, such as map,\nfilter, and fold. We implement our approach in Stevie, which formulates the\nrefactoring problem as a constraint optimisation problem. Our experiments on\nmultiple domains, including program synthesis and visual reasoning, show that\nrefactoring can improve the learning performance of an inductive logic\nprogramming system, specifically improving predictive accuracies by 27% and\nreducing learning times by 47%. We also show that Stevie can discover\nabstractions that transfer to multiple domains.", "comment": null, "links": []}
{"entry_id": "2401.16024", "title": "Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures", "authors": ["Michael Hersche", "Francesco di Stefano", "Thomas Hofmann", "Abu Sebastian", "Abbas Rahimi"], "published": "2024-01-29 10:17:18", "updated": "2024-01-29 10:17:18", "summary": "Abstract reasoning is a cornerstone of human intelligence, and replicating it\nwith artificial intelligence (AI) presents an ongoing challenge. This study\nfocuses on efficiently solving Raven's progressive matrices (RPM), a visual\ntest for assessing abstract reasoning abilities, by using distributed\ncomputation and operators provided by vector-symbolic architectures (VSA).\nInstead of hard-coding the rule formulations associated with RPMs, our approach\ncan learn the VSA rule formulations (hence the name Learn-VRF) with just one\npass through the training data. Yet, our approach, with compact parameters,\nremains transparent and interpretable. Learn-VRF yields accurate predictions on\nI-RAVEN's in-distribution data, and exhibits strong out-of-distribution\ncapabilities concerning unseen attribute-rule pairs, significantly\noutperforming pure connectionist baselines including large language models. Our\ncode is available at\nhttps://github.com/IBM/learn-vector-symbolic-architectures-rule-formulations.", "comment": "Accepted in NeurIPS 2023 Workshop on MATH-AI", "links": []}
{"entry_id": "2312.15915", "title": "ChartBench: A Benchmark for Complex Visual Reasoning in Charts", "authors": ["Zhengzhuo Xu", "Sinan Du", "Yiyan Qi", "Chengjin Xu", "Chun Yuan", "Jian Guo"], "published": "2023-12-26 07:20:55", "updated": "2024-01-29 03:04:20", "summary": "Multimodal Large Language Models (MLLMs) demonstrate impressive image\nunderstanding and generating capabilities. However, existing benchmarks employ\nlimited charts that deviate from real-world scenarios, posing challenges in\naccurately assessing the chart comprehension of MLLMs. To overcome this\nconstraint, we propose ChartBench, an exhaustive chart benchmark specifically\ndesigned to evaluate MLLMs' chart comprehension and data reliability through\ncomplex visual reasoning. ChartBench encompasses a wide spectrum, including 42\ncategories, 2.1K charts, and 16.8K question-answer pairs. Diverging from\nprevious benchmarks, ChartBench avoids employing data point annotation charts\nor metadata prompts directly. Instead, it compels MLLMs to derive values akin\nto human understanding by leveraging inherent chart elements such as color,\nlegends, or coordinate systems. Additionally, we propose an enhanced evaluation\nmetric, Acc+, which facilitates the evaluation of MLLMs without needing\nlabor-intensive manual efforts or costly evaluations based on GPT. Our\nextensive experimental evaluation involves 12 widely-used open-sourced and 2\nproprietary MLLMs, revealing the limitations of MLLMs in interpreting charts\nand providing valuable insights to encourage closer scrutiny of this aspect.", "comment": null, "links": []}
-{"entry_id": "2401.15847", "title": "Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA", "authors": ["Yue Fan", "Jing Gu", "Kaiwen Zhou", "Qianqi Yan", "Shan Jiang", "Ching-Chen Kuo", "Xinze Guan", "Xin Eric Wang"], "published": "2024-01-29 02:43:40", "updated": "2024-01-29 02:43:40", "summary": "Multipanel images, commonly seen as web screenshots, posters, etc., pervade\nour daily lives. These images, characterized by their composition of multiple\nsubfigures in distinct layouts, effectively convey information to people.\nToward building advanced multimodal AI applications, such as agents that\nunderstand complex scenes and navigate through webpages, the skill of\nmultipanel visual reasoning is essential, and a comprehensive evaluation of\nmodels in this regard is important. Therefore, our paper introduces Multipanel\nVisual Question Answering (MultipanelVQA), a novel benchmark that specifically\nchallenges models in comprehending multipanel images. The benchmark comprises\n6,600 questions and answers related to multipanel images. While these questions\nare straightforward for average humans, achieving nearly perfect correctness,\nthey pose significant challenges to the state-of-the-art Large Vision Language\nModels (LVLMs) we tested. In our study, we utilized synthetically curated\nmultipanel images specifically designed to isolate and evaluate the impact of\ndiverse factors on model performance, revealing the sensitivity of LVLMs to\nvarious interferences in multipanel images, such as adjacent subfigures and\nlayout complexity. As a result, MultipanelVQA highlights the need and direction\nfor improving LVLMs' ability to understand complex visual-language contexts.\nCode and data are released at https://sites.google.com/view/multipanelvqa/home.", "comment": null, "links": []}
{"entry_id": "2401.13311", "title": "ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models", "authors": ["Rohan Wadhawan", "Hritik Bansal", "Kai-Wei Chang", "Nanyun Peng"], "published": "2024-01-24 09:07:11", "updated": "2024-01-24 09:07:11", "summary": "Recent advancements in AI have led to the development of large multimodal\nmodels (LMMs) capable of processing complex tasks involving joint reasoning\nover text and visual content in the image (e.g., navigating maps in public\nplaces). This paper introduces ConTextual, a novel benchmark comprising\ninstructions designed explicitly to evaluate LMMs' ability to perform\ncontext-sensitive text-rich visual reasoning. ConTextual emphasizes diverse\nreal-world scenarios (e.g., time-reading, navigation, shopping and more)\ndemanding a deeper understanding of the interactions between textual and visual\nelements. Our findings reveal a significant performance gap of 30.8% between\nthe best-performing LMM, GPT-4V(ision), and human capabilities using human\nevaluation indicating substantial room for improvement in context-sensitive\ntext-rich visual reasoning. Notably, while GPT-4V excelled in abstract\ncategories like meme and quote interpretation, its overall performance still\nlagged behind humans. In addition to human evaluations, we also employed\nautomatic evaluation metrics using GPT-4, uncovering similar trends in\nperformance disparities. We also perform a fine-grained evaluation across\ndiverse visual contexts and provide qualitative analysis which provides a\nrobust framework for future advancements in the LMM design.\nhttps://con-textual.github.io/", "comment": null, "links": []}
{"entry_id": "2306.17778", "title": "Look, Remember and Reason: Grounded reasoning in videos with language models", "authors": ["Apratim Bhattacharyya", "Sunny Panchal", "Mingu Lee", "Reza Pourreza", "Pulkit Madan", "Roland Memisevic"], "published": "2023-06-30 16:31:14", "updated": "2024-01-22 00:54:30", "summary": "Multi-modal language models (LM) have recently shown promising performance in\nhigh-level reasoning tasks on videos. However, existing methods still fall\nshort in tasks like causal or compositional spatiotemporal reasoning over\nactions, in which model predictions need to be grounded in fine-grained\nlow-level details, such as object motions and object interactions. In this\nwork, we propose training an LM end-to-end on low-level surrogate tasks,\nincluding object detection, re-identification, and tracking, to endow the model\nwith the required low-level visual capabilities. We show that a two-stream\nvideo encoder with spatiotemporal attention is effective at capturing the\nrequired static and motion-based cues in the video. By leveraging the LM's\nability to perform the low-level surrogate tasks, we can cast reasoning in\nvideos as the three-step process of Look, Remember, Reason wherein visual\ninformation is extracted using low-level visual skills step-by-step and then\nintegrated to arrive at a final answer. We demonstrate the effectiveness of our\nframework on diverse visual reasoning tasks from the ACRE, CATER,\nSomething-Else and STAR datasets. Our approach is trainable end-to-end and\nsurpasses state-of-the-art task-specific methods across these tasks by a large\nmargin.", "comment": "To appear at ICLR 2024", "links": []}
{"entry_id": "2309.01409", "title": "Implicit Neural Image Stitching", "authors": ["Minsu Kim", "Jaewon Lee", "Byeonghun Lee", "Sunghoon Im", "Kyong Hwan Jin"], "published": "2023-09-04 07:40:30", "updated": "2024-01-22 00:22:14", "summary": "Existing frameworks for image stitching often provide visually reasonable\nstitchings. However, they suffer from blurry artifacts and disparities in\nillumination, depth level, etc. Although the recent learning-based stitchings\nrelax such disparities, the required methods impose sacrifice of image\nqualities failing to capture high-frequency details for stitched images. To\naddress the problem, we propose a novel approach, implicit Neural Image\nStitching (NIS) that extends arbitrary-scale super-resolution. Our method\nestimates Fourier coefficients of images for quality-enhancing warps. Then, the\nsuggested model blends color mismatches and misalignment in the latent space\nand decodes the features into RGB values of stitched images. Our experiments\nshow that our approach achieves improvement in resolving the low-definition\nimaging of the previous deep image stitching with favorable accelerated\nimage-enhancing methods. Our source code is available at\nhttps://github.com/minshu-kim/NIS.", "comment": null, "links": []}
@@ -9,7 +17,6 @@
{"entry_id": "2212.08044", "title": "Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift", "authors": ["Jielin Qiu", "Yi Zhu", "Xingjian Shi", "Florian Wenzel", "Zhiqiang Tang", "Ding Zhao", "Bo Li", "Mu Li"], "published": "2022-12-15 18:52:03", "updated": "2024-01-19 15:29:34", "summary": "Multimodal image-text models have shown remarkable performance in the past\nfew years. However, evaluating robustness against distribution shifts is\ncrucial before adopting them in real-world applications. In this work, we\ninvestigate the robustness of 12 popular open-sourced image-text models under\ncommon perturbations on five tasks (image-text retrieval, visual reasoning,\nvisual entailment, image captioning, and text-to-image generation). In\nparticular, we propose several new multimodal robustness benchmarks by applying\n17 image perturbation and 16 text perturbation techniques on top of existing\ndatasets. We observe that multimodal models are not robust to image and text\nperturbations, especially to image perturbations. Among the tested perturbation\nmethods, character-level perturbations constitute the most severe distribution\nshift for text, and zoom blur is the most severe shift for image data. We also\nintroduce two new robustness metrics (\\textbf{MMI} for MultiModal Impact score\nand \\textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal\nmodels. We hope our extensive study sheds light on new directions for the\ndevelopment of robust multimodal models. More details can be found on the\nproject webpage: \\url{https://MMRobustness.github.io}.", "comment": "Accepted by Journal of Data-centric Machine Learning Research (DMLR)\n 2024", "links": []}
{"entry_id": "2401.09966", "title": "Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection", "authors": ["Fan Shi", "Bin Li", "Xiangyang Xue"], "published": "2024-01-18 13:28:44", "updated": "2024-01-18 13:28:44", "summary": "Endowing machines with abstract reasoning ability has been a long-term\nresearch topic in artificial intelligence. Raven's Progressive Matrix (RPM) is\nwidely used to probe abstract visual reasoning in machine intelligence, where\nmodels need to understand the underlying rules and select the missing\nbottom-right images out of candidate sets to complete image matrices. The\nparticipators can display powerful reasoning ability by inferring the\nunderlying attribute-changing rules and imagining the missing images at\narbitrary positions. However, existing solvers can hardly manifest such an\nability in realistic RPM problems. In this paper, we propose a conditional\ngenerative model to solve answer generation problems through Rule AbstractIon\nand SElection (RAISE) in the latent space. RAISE encodes image attributes as\nlatent concepts and decomposes underlying rules into atomic rules by means of\nconcepts, which are abstracted as global learnable parameters. When generating\nthe answer, RAISE selects proper atomic rules out of the global knowledge set\nfor each concept and composes them into the integrated rule of an RPM. In most\nconfigurations, RAISE outperforms the compared generative solvers in tasks of\ngenerating bottom-right and arbitrary-position answers. We test RAISE in the\nodd-one-out task and two held-out configurations to demonstrate how learning\ndecoupled latent concepts and atomic rules helps find the image breaking the\nunderlying rules and handle RPMs with unseen combinations of rules and\nattributes.", "comment": null, "links": []}
{"entry_id": "2401.08695", "title": "Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence", "authors": ["Zhengqing Fang", "Shuowen Zhou", "Zhouhang Yuan", "Yuxuan Si", "Mengze Li", "Jinxu Li", "Yesheng Xu", "Wenjia Xie", "Kun Kuang", "Yingming Li", "Fei Wu", "Yu-Feng Yao"], "published": "2024-01-14 02:10:54", "updated": "2024-01-14 02:10:54", "summary": "Although data-driven artificial intelligence (AI) in medical image diagnosis\nhas shown impressive performance in silico, the lack of interpretability makes\nit difficult to incorporate the \"black box\" into clinicians' workflows. To make\nthe diagnostic patterns learned from data understandable by clinicians, we\ndevelop an interpretable model, knowledge-guided diagnosis model (KGDM), that\nprovides a visualized reasoning process containing AI-based biomarkers and\nretrieved cases that with the same diagnostic patterns. It embraces clinicians'\nprompts into the interpreted reasoning through human-AI interaction, leading to\npotentially enhanced safety and more accurate predictions. This study\ninvestigates the performance, interpretability, and clinical utility of KGDM in\nthe diagnosis of infectious keratitis (IK), which is the leading cause of\ncorneal blindness. The classification performance of KGDM is evaluated on a\nprospective validation dataset, an external testing dataset, and an publicly\navailable testing dataset. The diagnostic odds ratios (DOR) of the interpreted\nAI-based biomarkers are effective, ranging from 3.011 to 35.233 and exhibit\nconsistent diagnostic patterns with clinic experience. Moreover, a human-AI\ncollaborative diagnosis test is conducted and the participants with\ncollaboration achieved a performance exceeding that of both humans and AI. By\nsynergistically integrating interpretability and interaction, this study\nfacilitates the convergence of clinicians' expertise and data-driven\nintelligence. The promotion of inexperienced ophthalmologists with the aid of\nAI-based biomarkers, as well as increased AI prediction by intervention from\nexperienced ones, demonstrate a promising diagnostic paradigm for infectious\nkeratitis using KGDM, which holds the potential for extension to other diseases\nwhere experienced medical practitioners are limited and the safety of AI is\nconcerned.", "comment": "33 pages", "links": []}
-{"entry_id": "2401.04181", "title": "Language-Conditioned Robotic Manipulation with Fast and Slow Thinking", "authors": ["Minjie Zhu", "Yichen Zhu", "Jinming Li", "Junjie Wen", "Zhiyuan Xu", "Zhengping Che", "Chaomin Shen", "Yaxin Peng", "Dong Liu", "Feifei Feng", "Jian Tang"], "published": "2024-01-08 19:00:32", "updated": "2024-01-08 19:00:32", "summary": "The language-conditioned robotic manipulation aims to transfer natural\nlanguage instructions into executable actions, from simple pick-and-place to\ntasks requiring intent recognition and visual reasoning. Inspired by the dual\nprocess theory in cognitive science, which suggests two parallel systems of\nfast and slow thinking in human decision-making, we introduce Robotics with\nFast and Slow Thinking (RFST), a framework that mimics human cognitive\narchitecture to classify tasks and makes decisions on two systems based on\ninstruction types. Our RFST consists of two key components: 1) an instruction\ndiscriminator to determine which system should be activated based on the\ncurrent user instruction, and 2) a slow-thinking system that is comprised of a\nfine-tuned vision language model aligned with the policy networks, which allows\nthe robot to recognize user intention or perform reasoning tasks. To assess our\nmethodology, we built a dataset featuring real-world trajectories, capturing\nactions ranging from spontaneous impulses to tasks requiring deliberate\ncontemplation. Our results, both in simulation and real-world scenarios,\nconfirm that our approach adeptly manages intricate tasks that demand intent\nrecognition and reasoning. The project is available at\nhttps://jlm-z.github.io/RSFT/", "comment": "submitted to ICRA2024", "links": []}
{"entry_id": "2303.10428", "title": "A Region-Prompted Adapter Tuning for Visual Abductive Reasoning", "authors": ["Hao Zhang", "Yeo Keat Ee", "Basura Fernando"], "published": "2023-03-18 14:46:44", "updated": "2024-01-07 05:06:26", "summary": "Visual Abductive Reasoning is an emerging vision-language (VL) topic where\nthe model needs to retrieve/generate a likely textual hypothesis from a visual\ninput (image or its part) using backward reasoning based on commonsense. Unlike\nin conventional VL retrieval or captioning tasks, where entities of texts\nappear in the image, in abductive inferences, the relevant facts about\ninferences are not readily apparent in the input images. Besides, these\ninferences are causally linked to specific regional visual cues and would\nchange as cues change. Existing works highlight cues utilizing a specific\nprompt (e.g., colorful prompt). Then, a full fine-tuning of a VL foundation\nmodel is launched to tweak its function from perception to deduction. However,\nthe colorful prompt uniformly patchify ``regional hints'' and ``global\ncontext'' at the same granularity level and may lose fine-grained visual\ndetails crucial for VAR. Meanwhile, full fine-tuning of VLF on limited data\nwould easily be overfitted.\n To tackle this, we propose a simple yet effective Region-Prompted Adapter\n(RPA), a hybrid parameter-efficient fine-tuning method that leverages the\nstrengths of detailed cues and efficient training for the VAR task.\nRPA~consists of two novel modules: Regional Prompt Generator (RPG) and\nAdapter$^\\textbf{+}$. The prior encodes ``regional visual hints'' and ``global\ncontexts'' into visual prompts separately at fine and coarse-grained levels.\nThe latter extends the vanilla adapters with a new Map Adapter, which modifies\nthe attention map using a trainable low-dim query/key projection. Additionally,\nwe propose a new Dual-Contrastive Loss to regress the visual feature toward\nfeatures of factual description and plausible hypothesis. Experiments on the\nSherlock demonstrate that RPA outperforms previous SOTAs, achieving the 1st\nrank on leaderboards (Comparison to Human Accuracy: RPA~31.74 vs CPT-CLIP\n29.58).", "comment": "13 pages, 11 figures, Under Review of IEEE Transaction", "links": []}
{"entry_id": "2401.01974", "title": "Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers", "authors": ["Aleksandar Stanić", "Sergi Caelles", "Michael Tschannen"], "published": "2024-01-03 20:48:47", "updated": "2024-01-03 20:48:47", "summary": "Visual reasoning is dominated by end-to-end neural networks scaled to\nbillions of model parameters and training examples. However, even the largest\nmodels struggle with compositional reasoning, generalization, fine-grained\nspatial and temporal reasoning, and counting. Visual reasoning with large\nlanguage models (LLMs) as controllers can, in principle, address these\nlimitations by decomposing the task and solving subtasks by orchestrating a set\nof (visual) tools. Recently, these models achieved great performance on tasks\nsuch as compositional visual question answering, visual grounding, and video\ntemporal reasoning. Nevertheless, in their current form, these models heavily\nrely on human engineering of in-context examples in the prompt, which are often\ndataset- and task-specific and require significant labor by highly skilled\nprogrammers. In this work, we present a framework that mitigates these issues\nby introducing spatially and temporally abstract routines and by leveraging a\nsmall number of labeled examples to automatically generate in-context examples,\nthereby avoiding human-created in-context examples. On a number of visual\nreasoning tasks, we show that our framework leads to consistent gains in\nperformance, makes LLMs as controllers setup more robust, and removes the need\nfor human engineering of in-context examples.", "comment": null, "links": []}
{"entry_id": "2301.13335", "title": "Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning", "authors": ["Jian Zhu", "Hanli Wang", "Miaojing Shi"], "published": "2023-01-30 23:43:28", "updated": "2023-12-25 12:59:02", "summary": "The visual commonsense reasoning (VCR) task is to choose an answer and\nprovide a justifying rationale based on the given image and textural question.\nRepresentative works first recognize objects in images and then associate them\nwith key words in texts. However, existing approaches do not consider exact\npositions of objects in a human-like three-dimensional (3D) manner, making them\nincompetent to accurately distinguish objects and understand visual relation.\nRecently, multi-modal large language models (MLLMs) have been used as powerful\ntools for several multi-modal tasks but not for VCR yet, which requires\nelaborate reasoning on specific visual objects referred by texts. In light of\nthe above, an MLLM enhanced pseudo 3D perception framework is designed for VCR.\nSpecifically, we first demonstrate that the relation between objects is\nrelevant to object depths in images, and hence introduce object depth into VCR\nframeworks to infer 3D positions of objects in images. Then, a depth-aware\nTransformer is proposed to encode depth differences between objects into the\nattention mechanism of Transformer to discriminatively associate objects with\nvisual scenes guided by depth. To further associate the answer with the depth\nof visual scene, each word in the answer is tagged with a pseudo depth to\nrealize depth-aware association between answer words and objects. On the other\nhand, BLIP-2 as an MLLM is employed to process images and texts, and the\nreferring expressions in texts involving specific visual objects are modified\nwith linguistic object labels to serve as comprehensible MLLM inputs. Finally,\na parameter optimization technique is devised to fully consider the quality of\ndata batches based on multi-level reasoning confidence. Experiments on the VCR\ndataset demonstrate the superiority of the proposed framework over\nstate-of-the-art approaches.", "comment": null, "links": []}
@@ -58,7 +65,6 @@
{"entry_id": "2310.10207", "title": "Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World", "authors": ["Rujie Wu", "Xiaojian Ma", "Qing Li", "Wei Wang", "Zhenliang Zhang", "Song-Chun Zhu", "Yizhou Wang"], "published": "2023-10-16 09:19:18", "updated": "2023-10-16 09:19:18", "summary": "We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world\nfew-shot reasoning for machine vision. It originates from the classical Bongard\nProblems (BPs): Given two sets of images (positive and negative), the model\nneeds to identify the set that query images belong to by inducing the visual\nconcepts, which is exclusively depicted by images from the positive set. Our\nbenchmark inherits the few-shot concept induction of the original BPs while\nadding the two novel layers of challenge: 1) open-world free-form concepts, as\nthe visual concepts in Bongard-OpenWorld are unique compositions of terms from\nan open vocabulary, ranging from object categories to abstract visual\nattributes and commonsense factual knowledge; 2) real-world images, as opposed\nto the synthetic diagrams used by many counterparts. In our exploration,\nBongard-OpenWorld already imposes a significant challenge to current few-shot\nreasoning algorithms. We further investigate to which extent the recently\nintroduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can\nsolve our task, by directly probing VLMs, and combining VLMs and LLMs in an\ninteractive reasoning scheme. We even designed a neuro-symbolic reasoning\napproach that reconciles LLMs & VLMs with logical reasoning to emulate the\nhuman problem-solving process for Bongard Problems. However, none of these\napproaches manage to close the human-machine gap, as the best learner achieves\n64% accuracy while human participants easily reach 91%. We hope\nBongard-OpenWorld can help us better understand the limitations of current\nvisual intelligence and facilitate future research on visual agents with\nstronger few-shot visual reasoning capabilities.", "comment": "Project page: https://joyjayng.github.io/Bongard-OpenWorld.github.io", "links": []}
{"entry_id": "2309.16705", "title": "Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning", "authors": ["David Noever", "Samantha Elizabeth Miller Noever"], "published": "2023-08-17 03:14:00", "updated": "2023-10-14 19:53:39", "summary": "Addressing the gap in understanding visual comprehension in Large Language\nModels (LLMs), we designed a challenge-response study, subjecting Google Bard\nand GPT-Vision to 64 visual tasks, spanning categories like \"Visual Situational\nReasoning\" and \"Next Scene Prediction.\" Previous models, such as GPT4, leaned\nheavily on optical character recognition tools like Tesseract, whereas Bard and\nGPT-Vision, akin to Google Lens and Visual API, employ deep learning techniques\nfor visual text recognition. However, our findings spotlight both\nvision-language model's limitations: while proficient in solving visual\nCAPTCHAs that stump ChatGPT alone, it falters in recreating visual elements\nlike ASCII art or analyzing Tic Tac Toe grids, suggesting an over-reliance on\neducated visual guesses. The prediction problem based on visual inputs appears\nparticularly challenging with no common-sense guesses for next-scene\nforecasting based on current \"next-token\" multimodal models. This study\nprovides experimental insights into the current capacities and areas for\nimprovement in multimodal LLMs.", "comment": null, "links": []}
{"entry_id": "2307.03601", "title": "GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest", "authors": ["Shilong Zhang", "Peize Sun", "Shoufa Chen", "Min Xiao", "Wenqi Shao", "Wenwei Zhang", "Yu Liu", "Kai Chen", "Ping Luo"], "published": "2023-07-07 13:43:44", "updated": "2023-10-13 03:25:34", "summary": "Visual instruction tuning large language model(LLM) on image-text pairs has\nachieved general-purpose vision-language abilities. However, the lack of\nregion-text pairs limits their advancements to fine-grained multimodal\nunderstanding. In this paper, we propose spatial instruction tuning, which\nintroduces the reference to the region-of-interest(RoI) in the instruction.\nBefore sending to LLM, the reference is replaced by RoI features and\ninterleaved with language embeddings as a sequence. Our model GPT4RoI, trained\non 7 region-text pair datasets, brings an unprecedented interactive and\nconversational experience compared to previous image-level models. (1)\nInteraction beyond language: Users can interact with our model by both language\nand drawing bounding boxes to flexibly adjust the referring granularity. (2)\nVersatile multimodal abilities: A variety of attribute information within each\nRoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc.\nFurthermore, it can reason about multiple RoIs based on common sense. On the\nVisual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable\naccuracy of 81.6%, surpassing all existing models by a significant margin (the\nsecond place is 75.6%) and almost reaching human-level performance of 85.0%.\nThe code, dataset, and demo can be found at\nhttps://github.com/jshilong/GPT4RoI.", "comment": "Code has been released at https://github.com/jshilong/GPT4RoI", "links": []}
-{"entry_id": "2310.04671", "title": "Visual Abductive Reasoning Meets Driving Hazard Prediction: Problem Formulation and Dataset", "authors": ["Korawat Charoenpitaks", "Van-Quang Nguyen", "Masanori Suganuma", "Masahiro Takahashi", "Ryoma Niihara", "Takayuki Okatani"], "published": "2023-10-07 03:16:30", "updated": "2023-10-10 02:31:24", "summary": "This paper addresses the problem of predicting hazards that drivers may\nencounter while driving a car. We formulate it as a task of anticipating\nimpending accidents using a single input image captured by car dashcams. Unlike\nexisting approaches to driving hazard prediction that rely on computational\nsimulations or anomaly detection from videos, this study focuses on high-level\ninference from static images. The problem needs predicting and reasoning about\nfuture events based on uncertain observations, which falls under visual\nabductive reasoning. To enable research in this understudied area, a new\ndataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is\ncreated. The dataset consists of 15K dashcam images of street scenes, and each\nimage is associated with a tuple containing car speed, a hypothesized hazard\ndescription, and visual entities present in the scene. These are annotated by\nhuman annotators, who identify risky scenes and provide descriptions of\npotential accidents that could occur a few seconds later. We present several\nbaseline methods and evaluate their performance on our dataset, identifying\nremaining issues and discussing future directions. This study contributes to\nthe field by introducing a novel problem formulation and dataset, enabling\nresearchers to explore the potential of multi-modal AI for driving hazard\nprediction.", "comment": "Main Paper: 10 pages, Supplementary Materials: 25 pages", "links": []}
{"entry_id": "2310.05872", "title": "ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models", "authors": ["Kaiwen Zhou", "Kwonjoon Lee", "Teruhisa Misu", "Xin Eric Wang"], "published": "2023-10-09 17:10:35", "updated": "2023-10-09 17:10:35", "summary": "In our work, we explore the synergistic capabilities of pre-trained\nvision-and-language models (VLMs) and large language models (LLMs) for visual\ncommonsense reasoning (VCR). We categorize the problem of VCR into visual\ncommonsense understanding (VCU) and visual commonsense inference (VCI). For\nVCU, which involves perceiving the literal visual content, pre-trained VLMs\nexhibit strong cross-dataset generalization. On the other hand, in VCI, where\nthe goal is to infer conclusions beyond image content, VLMs face difficulties.\nWe find that a baseline where VLMs provide perception results (image captions)\nto LLMs leads to improved performance on VCI. However, we identify a challenge\nwith VLMs' passive perception, which often misses crucial context information,\nleading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we\nsuggest a collaborative approach where LLMs, when uncertain about their\nreasoning, actively direct VLMs to concentrate on and gather relevant visual\nelements to support potential commonsense inferences. In our method, named\nViCor, pre-trained LLMs serve as problem classifiers to analyze the problem\ncategory, VLM commanders to leverage VLMs differently based on the problem\nclassification, and visual commonsense reasoners to answer the question. VLMs\nwill perform visual recognition and understanding. We evaluate our framework on\ntwo VCR benchmark datasets and outperform all other methods that do not require\nin-domain supervised fine-tuning.", "comment": null, "links": []}
{"entry_id": "2309.06659", "title": "Beyond English: Centering Multilingualism in Data Visualization", "authors": ["Noëlle Rakotondravony", "Priya Dhawka", "Melanie Bancilhon"], "published": "2023-09-13 01:17:10", "updated": "2023-10-02 21:01:13", "summary": "Information visualization and natural language are intricately linked.\nHowever, the majority of research and relevant work in information and data\nvisualization (and human-computer interaction) involve English-speaking\npopulations as both researchers and participants, are published in English, and\nare presented predominantly at English-speaking venues. Although several\nsolutions can be proposed such as translating English texts in visualization to\nother languages, there is little research that looks at the intersection of\ndata visualization and different languages, and the implications that current\nvisualization practices have on non-English speaking communities. In this\nposition paper, we argue that linguistically diverse communities abound beyond\nthe English-speaking world and offer a richness of experiences for the\nvisualization research community to engage with. Through a case study of how\ntwo non-English languages interplay with data visualization reasoning in\nMadagascar, we describe how monolingualism in data visualization impacts the\nexperiences of underrepresented populations and emphasize potential harm to\nthese communities. Lastly, we raise several questions towards advocating for\nmore inclusive visualization practices that center the diverse experiences of\nlinguistically underrepresented populations.", "comment": "5 pages, 1 figure, Visualization for Social Good @VIS23", "links": []}
{"entry_id": "2308.16463", "title": "Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models", "authors": ["Yupan Huang", "Zaiqiao Meng", "Fangyu Liu", "Yixuan Su", "Nigel Collier", "Yutong Lu"], "published": "2023-08-31 05:15:27", "updated": "2023-10-02 03:31:17", "summary": "Large language models exhibit enhanced zero-shot performance on various tasks\nwhen fine-tuned with instruction-following data. Multimodal\ninstruction-following models extend these capabilities by integrating both text\nand images. However, existing models such as MiniGPT-4 face challenges in\nmaintaining dialogue coherence in scenarios involving multiple images. A\nprimary reason is the lack of a specialized dataset for this critical\napplication. To bridge these gaps, we present SparklesChat, a multimodal\ninstruction-following model for open-ended dialogues across multiple images. To\nsupport the training, we introduce SparklesDialogue, the first\nmachine-generated dialogue dataset tailored for word-level interleaved\nmulti-image and text interactions. Furthermore, we construct SparklesEval, a\nGPT-assisted benchmark for quantitatively assessing a model's conversational\ncompetence across multiple images and dialogue turns. Our experiments validate\nthe effectiveness of SparklesChat in understanding and reasoning across\nmultiple images and dialogue turns. Specifically, SparklesChat outperformed\nMiniGPT-4 on established vision-and-language benchmarks, including the BISON\nbinary image selection task and the NLVR2 visual reasoning task. Moreover,\nSparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding\nMiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative\nevaluations further demonstrate SparklesChat's generality in handling\nreal-world applications. All resources are available at\nhttps://github.com/HYPJUDY/Sparkles.", "comment": "Reduced main content to 9 pages; typos corrected", "links": []}
diff --git a/arxiv_visual_reasoning.md b/arxiv_visual_reasoning.md
index d1df557..55e49bf 100644
--- a/arxiv_visual_reasoning.md
+++ b/arxiv_visual_reasoning.md
@@ -8,11 +8,279 @@ and is automatically generated by [update_arxiv.py](./tool/update_arxiv.py).
-Last update: 2024-02-01 08:02:27
+Last update: 2024-03-01 08:02:22
___
-## [Learning logic programs by discovering higher-order abstractions](https://arxiv.org/pdf/2308.08334) [New]
+## [Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories](https://arxiv.org/pdf/2308.10562) [New]
+
+*Delfina Sol Martinez Pandiani, Valentina Presutti*
+
+**Abstract:** The field of Computer Vision (CV) is increasingly shifting towards
+``high-level'' visual sensemaking tasks, yet the exact nature of these tasks
+remains unclear and tacit. This survey paper addresses this ambiguity by
+systematically reviewing research on high-level visual understanding, focusing
+particularly on Abstract Concepts (ACs) in automatic image classification. Our
+survey contributes in three main ways: Firstly, it clarifies the tacit
+understanding of high-level semantics in CV through a multidisciplinary
+analysis, and categorization into distinct clusters, including commonsense,
+emotional, aesthetic, and inductive interpretative semantics. Secondly, it
+identifies and categorizes computer vision tasks associated with high-level
+visual sensemaking, offering insights into the diverse research areas within
+this domain. Lastly, it examines how abstract concepts such as values and
+ideologies are handled in CV, revealing challenges and opportunities in
+AC-based image classification. Notably, our survey of AC image classification
+tasks highlights persistent challenges, such as the limited efficacy of massive
+datasets and the importance of integrating supplementary information and
+mid-level features. We emphasize the growing relevance of hybrid AI systems in
+addressing the multifaceted nature of AC image classification tasks. Overall,
+this survey enhances our understanding of high-level visual reasoning in CV and
+lays the groundwork for future research endeavors.
+
+**comment:** *Preprint*
+
+**published:** *2023-08-21 08:37:04*, **updated:** *2024-02-29 16:18:45*
+
+
+
+## [Visual Abductive Reasoning Meets Driving Hazard Prediction](https://arxiv.org/pdf/2310.04671) [New]
+
+*Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani*
+
+**Abstract:** This paper addresses the problem of predicting hazards that drivers may
+encounter while driving a car. We formulate it as a task of anticipating
+impending accidents using a single input image captured by car dashcams. Unlike
+existing approaches to driving hazard prediction that rely on computational
+simulations or anomaly detection from videos, this study focuses on high-level
+inference from static images. The problem needs predicting and reasoning about
+future events based on uncertain observations, which falls under visual
+abductive reasoning. To enable research in this understudied area, a new
+dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is
+created. The dataset consists of 15K dashcam images of street scenes, and each
+image is associated with a tuple containing car speed, a hypothesized hazard
+description, and visual entities present in the scene. These are annotated by
+human annotators, who identify risky scenes and provide descriptions of
+potential accidents that could occur a few seconds later. We present several
+baseline methods and evaluate their performance on our dataset, identifying
+remaining issues and discussing future directions. This study contributes to
+the field by introducing a novel problem formulation and dataset, enabling
+researchers to explore the potential of multi-modal AI for driving hazard
+prediction.
+
+**comment:** *Main Paper: 10 pages, Supplementary Materials: 28 pages*
+
+**published:** *2023-10-07 03:16:30*, **updated:** *2024-02-27 14:22:09*
+
+
+
+## [PALO: A Polyglot Large Multimodal Model for 5B People](https://arxiv.org/pdf/2402.14818) [New]
+
+*Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan*
+
+**Abstract:** In pursuit of more inclusive Vision-Language Models (VLMs), this study
+introduces a Large Multilingual Multimodal Model called \textsc{Palo}.
+\textsc{Palo} offers visual reasoning capabilities in 10 major languages,
+including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian,
+Urdu, and Japanese, that span a total of $\sim$5B people (65\% of the world
+population). Our approach involves a semi-automated translation approach to
+adapt the multimodal instruction dataset from English to the target languages
+using a fine-tuned Large Language Model, thereby ensuring high linguistic
+fidelity while allowing scalability due to minimal manual effort. The
+incorporation of diverse instruction sets helps us boost overall performance
+across multiple languages especially those that are underrepresented like
+Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three
+scales (1.7B, 7B and 13B parameters) to show the generalization and scalability
+where we observe substantial improvements compared to strong baselines. We also
+propose the first multilingual multimodal benchmark for the forthcoming
+approaches to evaluate their vision-language reasoning capabilities across
+languages. Code: https://github.com/mbzuai-oryx/PALO.
+
+**comment:** *Technical Report of PALO*
+
+**published:** *2024-02-22 18:59:58*, **updated:** *2024-02-22 18:59:58*
+
+
+
+## [Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach](https://arxiv.org/pdf/2402.12675) [New]
+
+*Guillermo Puebla, Jeffrey S. Bowers*
+
+**Abstract:** Achieving visual reasoning is a long-term goal of artificial intelligence. In
+the last decade, several studies have applied deep neural networks (DNNs) to
+the task of learning visual relations from images, with modest results in terms
+of generalization of the relations learned. However, in recent years,
+object-centric representation learning has been put forward as a way to achieve
+visual reasoning within the deep learning framework. Object-centric models
+attempt to model input scenes as compositions of objects and relations between
+them. To this end, these models use several kinds of attention mechanisms to
+segregate the individual objects in a scene from the background and from other
+objects. In this work we tested relation learning and generalization in several
+object-centric models, as well as a ResNet-50 baseline. In contrast to previous
+research, which has focused heavily in the same-different task in order to
+asses relational reasoning in DNNs, we use a set of tasks -- with varying
+degrees of difficulty -- derived from the comparative cognition literature. Our
+results show that object-centric models are able to segregate the different
+objects in a scene, even in many out-of-distribution cases. In our simpler
+tasks, this improves their capacity to learn and generalize visual relations in
+comparison to the ResNet-50 baseline. However, object-centric models still
+struggle in our more difficult tasks and conditions. We conclude that abstract
+visual reasoning remains an open challenge for DNNs, including object-centric
+models.
+
+**comment:** *16 pages, 14 figures*
+
+**published:** *2024-02-20 02:48:14*, **updated:** *2024-02-20 02:48:14*
+
+
+
+## [Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA](https://arxiv.org/pdf/2401.15847) [New]
+
+*Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang*
+
+**Abstract:** Multipanel images, commonly seen as web screenshots, posters, etc., pervade
+our daily lives. These images, characterized by their composition of multiple
+subfigures in distinct layouts, effectively convey information to people.
+Toward building advanced multimodal AI applications, such as agents that
+understand complex scenes and navigate through webpages, the skill of
+multipanel visual reasoning is essential, and a comprehensive evaluation of
+models in this regard is important. Therefore, we introduce Multipanel Visual
+Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets
+of questions, answers, and multipanel images that specifically challenge models
+in comprehending multipanel images. Our evaluation shows that questions in the
+MultipanelVQA benchmark pose significant challenges to the state-of-the-art
+Large Vision Language Models (LVLMs) tested, even though humans can attain
+approximately 99\% accuracy on these questions. Distinctively, the
+MultipanelVQA benchmark features synthetically generated multipanel images
+specifically crafted to isolate and assess the impact of various factors, such
+as the layout, on LVLMs' multipanel image comprehension abilities. As a result,
+in addition to benchmarking the capabilities of LVLMs in understanding
+multipanel images, we analyze the potential causes for LVLMs' performance and
+offer insights for enhancement with the synthetic data. Code and data are
+released at https://sites.google.com/view/multipanelvqa/home.
+
+**published:** *2024-01-29 02:43:40*, **updated:** *2024-02-19 05:14:56*
+
+
+
+## [Visual In-Context Learning for Large Vision-Language Models](https://arxiv.org/pdf/2402.11574) [New]
+
+*Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen*
+
+**Abstract:** In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning
+(ICL) remains limited by challenges in cross-modal interactions and
+representation disparities. To overcome these challenges, we introduce a novel
+Visual In-Context Learning (VICL) method comprising Visual Demonstration
+Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented
+Demonstration Composition. Our approach retrieves images via ''Retrieval &
+Rerank'' paradigm, summarises images with task intent and task-specific visual
+parsing, and composes language-based demonstrations that reduce token count and
+alleviate cross-modal interaction problem. Experimental evaluations on five
+visual reasoning datasets demonstrate the effectiveness of our method.
+Moreover, our extensive experiments leverage information flow analysis to
+elucidate the effectiveness of our method, and investigate the impact of length
+and position of demonstrations for LVLM. The use of in-context unlearning
+further shows promise in resetting specific model knowledge without retraining.
+
+**comment:** *13 pages, 7 figures*
+
+**published:** *2024-02-18 12:43:38*, **updated:** *2024-02-18 12:43:38*
+
+
+
+## [CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations](https://arxiv.org/pdf/2402.04236) [New]
+
+*Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang*
+
+**Abstract:** Vision-Language Models (VLMs) have demonstrated their widespread viability
+thanks to extensive training in aligning visual instructions to answers.
+However, this conclusive alignment leads models to ignore critical visual
+reasoning, and further result in failures on meticulous visual problems and
+unfaithful responses. In this paper, we propose Chain of Manipulations, a
+mechanism that enables VLMs to solve problems with a series of manipulations,
+where each manipulation refers to an operation on the visual input, either from
+intrinsic abilities (e.g., grounding) acquired through prior training or from
+imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs
+to generate faithful responses with evidential visual reasoning, and permits
+users to trace error causes in the interpretable paths. We thus train CogCoM, a
+general 17B VLM with a memory-based compatible architecture endowed this
+reasoning mechanism. Experiments show that our model achieves the
+state-of-the-art performance across 8 benchmarks from 3 categories, and a
+limited number of training steps with the data swiftly gains a competitive
+performance. The code and data are publicly available at
+https://github.com/THUDM/CogCoM.
+
+**comment:** *17 pages, 7 figures*
+
+**published:** *2024-02-06 18:43:48*, **updated:** *2024-02-06 18:43:48*
+
+
+
+## [Neural networks for abstraction and reasoning: Towards broad generalization in machines](https://arxiv.org/pdf/2402.03507) [New]
+
+*Mikel Bober-Irizar, Soumya Banerjee*
+
+**Abstract:** For half a century, artificial intelligence research has attempted to
+reproduce the human qualities of abstraction and reasoning - creating computer
+systems that can learn new concepts from a minimal set of examples, in settings
+where humans find this easy. While specific neural networks are able to solve
+an impressive range of problems, broad generalisation to situations outside
+their training data has proved elusive.In this work, we look at several novel
+approaches for solving the Abstraction & Reasoning Corpus (ARC), a dataset of
+abstract visual reasoning tasks introduced to test algorithms on broad
+generalization. Despite three international competitions with $100,000 in
+prizes, the best algorithms still fail to solve a majority of ARC tasks and
+rely on complex hand-crafted rules, without using machine learning at all. We
+revisit whether recent advances in neural networks allow progress on this task.
+ First, we adapt the DreamCoder neurosymbolic reasoning solver to ARC.
+DreamCoder automatically writes programs in a bespoke domain-specific language
+to perform reasoning, using a neural network to mimic human intuition. We
+present the Perceptual Abstraction and Reasoning Language (PeARL) language,
+which allows DreamCoder to solve ARC tasks, and propose a new recognition model
+that allows us to significantly improve on the previous best implementation.We
+also propose a new encoding and augmentation scheme that allows large language
+models (LLMs) to solve ARC tasks, and find that the largest models can solve
+some ARC tasks. LLMs are able to solve a different group of problems to
+state-of-the-art solvers, and provide an interesting way to complement other
+approaches. We perform an ensemble analysis, combining models to achieve better
+results than any system alone. Finally, we publish the arckit Python library to
+make future research on ARC easier.
+
+**comment:** *32 pages main text, 17 pages*
+
+**published:** *2024-02-05 20:48:57*, **updated:** *2024-02-05 20:48:57*
+
+
+
+## [Language-Conditioned Robotic Manipulation with Fast and Slow Thinking](https://arxiv.org/pdf/2401.04181) [New]
+
+*Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang*
+
+**Abstract:** The language-conditioned robotic manipulation aims to transfer natural
+language instructions into executable actions, from simple pick-and-place to
+tasks requiring intent recognition and visual reasoning. Inspired by the dual
+process theory in cognitive science, which suggests two parallel systems of
+fast and slow thinking in human decision-making, we introduce Robotics with
+Fast and Slow Thinking (RFST), a framework that mimics human cognitive
+architecture to classify tasks and makes decisions on two systems based on
+instruction types. Our RFST consists of two key components: 1) an instruction
+discriminator to determine which system should be activated based on the
+current user instruction, and 2) a slow-thinking system that is comprised of a
+fine-tuned vision language model aligned with the policy networks, which allows
+the robot to recognize user intention or perform reasoning tasks. To assess our
+methodology, we built a dataset featuring real-world trajectories, capturing
+actions ranging from spontaneous impulses to tasks requiring deliberate
+contemplation. Our results, both in simulation and real-world scenarios,
+confirm that our approach adeptly manages intricate tasks that demand intent
+recognition and reasoning. The project is available at
+https://jlm-z.github.io/RSFT/
+
+**comment:** *accepted to ICRA2024*
+
+**published:** *2024-01-08 19:00:32*, **updated:** *2024-02-01 08:32:33*
+
+
+
+## [Learning logic programs by discovering higher-order abstractions](https://arxiv.org/pdf/2308.08334)
*Céline Hocquette, Sebastijan Dumančić, Andrew Cropper*
@@ -30,7 +298,7 @@ abstractions that transfer to multiple domains.
-## [Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures](https://arxiv.org/pdf/2401.16024) [New]
+## [Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures](https://arxiv.org/pdf/2401.16024)
*Michael Hersche, Francesco di Stefano, Thomas Hofmann, Abu Sebastian, Abbas Rahimi*
@@ -55,7 +323,7 @@ https://github.com/IBM/learn-vector-symbolic-architectures-rule-formulations.
-## [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/pdf/2312.15915) [New]
+## [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/pdf/2312.15915)
*Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo*
@@ -81,35 +349,7 @@ and providing valuable insights to encourage closer scrutiny of this aspect.
-## [Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA](https://arxiv.org/pdf/2401.15847) [New]
-
-*Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang*
-
-**Abstract:** Multipanel images, commonly seen as web screenshots, posters, etc., pervade
-our daily lives. These images, characterized by their composition of multiple
-subfigures in distinct layouts, effectively convey information to people.
-Toward building advanced multimodal AI applications, such as agents that
-understand complex scenes and navigate through webpages, the skill of
-multipanel visual reasoning is essential, and a comprehensive evaluation of
-models in this regard is important. Therefore, our paper introduces Multipanel
-Visual Question Answering (MultipanelVQA), a novel benchmark that specifically
-challenges models in comprehending multipanel images. The benchmark comprises
-6,600 questions and answers related to multipanel images. While these questions
-are straightforward for average humans, achieving nearly perfect correctness,
-they pose significant challenges to the state-of-the-art Large Vision Language
-Models (LVLMs) we tested. In our study, we utilized synthetically curated
-multipanel images specifically designed to isolate and evaluate the impact of
-diverse factors on model performance, revealing the sensitivity of LVLMs to
-various interferences in multipanel images, such as adjacent subfigures and
-layout complexity. As a result, MultipanelVQA highlights the need and direction
-for improving LVLMs' ability to understand complex visual-language contexts.
-Code and data are released at https://sites.google.com/view/multipanelvqa/home.
-
-**published:** *2024-01-29 02:43:40*, **updated:** *2024-01-29 02:43:40*
-
-
-
-## [ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models](https://arxiv.org/pdf/2401.13311) [New]
+## [ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models](https://arxiv.org/pdf/2401.13311)
*Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng*
@@ -137,7 +377,7 @@ https://con-textual.github.io/
-## [Look, Remember and Reason: Grounded reasoning in videos with language models](https://arxiv.org/pdf/2306.17778) [New]
+## [Look, Remember and Reason: Grounded reasoning in videos with language models](https://arxiv.org/pdf/2306.17778)
*Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic*
@@ -166,7 +406,7 @@ margin.
-## [Implicit Neural Image Stitching](https://arxiv.org/pdf/2309.01409) [New]
+## [Implicit Neural Image Stitching](https://arxiv.org/pdf/2309.01409)
*Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin*
@@ -189,7 +429,7 @@ https://github.com/minshu-kim/NIS.
-## [Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually](https://arxiv.org/pdf/2401.11035) [New]
+## [Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually](https://arxiv.org/pdf/2401.11035)
*Mazal Bethany, Brandon Wherry, Nishant Vishwamitra, Peyman Najafirad*
@@ -221,7 +461,7 @@ https://github.com/SecureAIAutonomyLab/ConditionalVLM
-## [Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift](https://arxiv.org/pdf/2212.08044) [New]
+## [Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift](https://arxiv.org/pdf/2212.08044)
*Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li*
@@ -250,7 +490,7 @@ project webpage: \url{https://MMRobustness.github.io}.
-## [Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection](https://arxiv.org/pdf/2401.09966) [New]
+## [Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection](https://arxiv.org/pdf/2401.09966)
*Fan Shi, Bin Li, Xiangyang Xue*
@@ -280,7 +520,7 @@ attributes.
-## [Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence](https://arxiv.org/pdf/2401.08695) [New]
+## [Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence](https://arxiv.org/pdf/2401.08695)
*Zhengqing Fang, Shuowen Zhou, Zhouhang Yuan, Yuxuan Si, Mengze Li, Jinxu Li, Yesheng Xu, Wenjia Xie, Kun Kuang, Yingming Li, Fei Wu, Yu-Feng Yao*
@@ -317,36 +557,7 @@ concerned.
-## [Language-Conditioned Robotic Manipulation with Fast and Slow Thinking](https://arxiv.org/pdf/2401.04181) [New]
-
-*Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang*
-
-**Abstract:** The language-conditioned robotic manipulation aims to transfer natural
-language instructions into executable actions, from simple pick-and-place to
-tasks requiring intent recognition and visual reasoning. Inspired by the dual
-process theory in cognitive science, which suggests two parallel systems of
-fast and slow thinking in human decision-making, we introduce Robotics with
-Fast and Slow Thinking (RFST), a framework that mimics human cognitive
-architecture to classify tasks and makes decisions on two systems based on
-instruction types. Our RFST consists of two key components: 1) an instruction
-discriminator to determine which system should be activated based on the
-current user instruction, and 2) a slow-thinking system that is comprised of a
-fine-tuned vision language model aligned with the policy networks, which allows
-the robot to recognize user intention or perform reasoning tasks. To assess our
-methodology, we built a dataset featuring real-world trajectories, capturing
-actions ranging from spontaneous impulses to tasks requiring deliberate
-contemplation. Our results, both in simulation and real-world scenarios,
-confirm that our approach adeptly manages intricate tasks that demand intent
-recognition and reasoning. The project is available at
-https://jlm-z.github.io/RSFT/
-
-**comment:** *submitted to ICRA2024*
-
-**published:** *2024-01-08 19:00:32*, **updated:** *2024-01-08 19:00:32*
-
-
-
-## [A Region-Prompted Adapter Tuning for Visual Abductive Reasoning](https://arxiv.org/pdf/2303.10428) [New]
+## [A Region-Prompted Adapter Tuning for Visual Abductive Reasoning](https://arxiv.org/pdf/2303.10428)
*Hao Zhang, Yeo Keat Ee, Basura Fernando*
@@ -384,7 +595,7 @@ rank on leaderboards (Comparison to Human Accuracy: RPA~31.74 vs CPT-CLIP
-## [Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers](https://arxiv.org/pdf/2401.01974) [New]
+## [Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers](https://arxiv.org/pdf/2401.01974)
*Aleksandar Stanić, Sergi Caelles, Michael Tschannen*
@@ -1739,36 +1950,6 @@ https://github.com/jshilong/GPT4RoI.
-## [Visual Abductive Reasoning Meets Driving Hazard Prediction: Problem Formulation and Dataset](https://arxiv.org/pdf/2310.04671)
-
-*Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani*
-
-**Abstract:** This paper addresses the problem of predicting hazards that drivers may
-encounter while driving a car. We formulate it as a task of anticipating
-impending accidents using a single input image captured by car dashcams. Unlike
-existing approaches to driving hazard prediction that rely on computational
-simulations or anomaly detection from videos, this study focuses on high-level
-inference from static images. The problem needs predicting and reasoning about
-future events based on uncertain observations, which falls under visual
-abductive reasoning. To enable research in this understudied area, a new
-dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is
-created. The dataset consists of 15K dashcam images of street scenes, and each
-image is associated with a tuple containing car speed, a hypothesized hazard
-description, and visual entities present in the scene. These are annotated by
-human annotators, who identify risky scenes and provide descriptions of
-potential accidents that could occur a few seconds later. We present several
-baseline methods and evaluate their performance on our dataset, identifying
-remaining issues and discussing future directions. This study contributes to
-the field by introducing a novel problem formulation and dataset, enabling
-researchers to explore the potential of multi-modal AI for driving hazard
-prediction.
-
-**comment:** *Main Paper: 10 pages, Supplementary Materials: 25 pages*
-
-**published:** *2023-10-07 03:16:30*, **updated:** *2023-10-10 02:31:24*
-
-
-
## [ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models](https://arxiv.org/pdf/2310.05872)
*Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang*