This repo includes the implementations described in the paper "What Is Missing in Multilingual Visual Reasoning and How to Fix It".
In this work, we explore various models, including GPT-4V, LLaVA, mBLIP, CCLM, and UNITERs on NLVR2 and MaRVL. NLVR2 and MaRVL involves the same task of reasoning whether a natural language statement is true based on a pair of images. An example is shown below.
Analyzing the failures of these models, we find that multilinguality, complex reasoning, and multimodality are three key aspects that make this task so challenging to models. Based on our analysis, we explore and propose three interventions to address these three challenges: translation, visual programming, and reasoning with captions. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA's performance by 13.4%, while also minorly improving GPT-4V's performance.
-
Clone this repo.
-
Download the data: all json files are already in
data/
, but you will need to download the images. For NLVR2 images (around 15G), please fill out the Google Form by the NLVR2 group. MaRVL images (around 2.5G) can be downloaded from the Dataverse portal. -
Run the experiments: we experiment various models, including GPT-4V, LLaVA, mBLIP, CCLM, UNITERs, and ViLT. Each model requires different environment, so we include a
README.md
file for each model in its directory, indicating how to reproduce the experiments.
Here, we briefly describe all experiments we run. More details can be found in our paper. We first evaluate models' performance on NLVR2 and MaRVL under zero-shot or finetuned settings, and propose three interventions: translate test, visual programming, and reasoning with captions.
In this setting, models are not specifically fine-tuned for the task of visual reasoning. Instead, we use LMM's zero-shot learning abilities by providing them prompts with instructions on how to solve the task. We evaluate the following models:
GPT-4V: Instructions on how to reproduce the results are provided in gpt4v/README.md.
LLaVA: Instructions on how to reproduce the results are provided in LLaVA/README.md.
mBLIP: Instructions on how to reproduce the results are provided in mBLIP/README.md.
In this setting, models are finetuned on the English NLVR2 dataset, and test them on both NLVR2 and MaRVL. We evaluate the following models:
mBLIP: Instructions on how to reproduce the results are provided in mBLIP/READEME.md.
CCLM: Instructions on how to reproduce the results are provided in CCLM/READEME.md.
UNITERs: We evaluate xUNITER and mUNITER, and instructions on how to reproduce the results are provided in uniters/READEME.md.
We perform three interventions. The pipelines for each can be found in the flow chart below.
Instead of evaluating models on multilingual text, we evaluate models on the MaRVL dataset with multilingual text translated to English using Google translate API.
We use the approach introduced in VisProg, solving multimodal reasoning using LLMs to generate visual programs, that leverage off-the-shelf computer vision models for image processing during inference.
We first caption both images, and use an LLM to reason about the given statement with the two captions, rather than with the two images.
We upload all our experimental results to Zeno, a platform which we use to visualize our results. An example visualization is shown below.
Note: Our paper is under anonymous review. Since the links to zeno is unanonymized, we will post the links to zeno post-review.
Our paper is under anonymous review. We will post the citation to our paper post-review.