Multilingual Visual Reasoning

This repo includes the implementations described in the paper "What Is Missing in Multilingual Visual Reasoning and How to Fix It".

Brief Overview

In this work, we explore various models, including GPT-4V, LLaVA, mBLIP, CCLM, and UNITERs on NLVR2 and MaRVL. NLVR2 and MaRVL involves the same task of reasoning whether a natural language statement is true based on a pair of images. An example is shown below.

Analyzing the failures of these models, we find that multilinguality, complex reasoning, and multimodality are three key aspects that make this task so challenging to models. Based on our analysis, we explore and propose three interventions to address these three challenges: translation, visual programming, and reasoning with captions. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA's performance by 13.4%, while also minorly improving GPT-4V's performance.

Repo Setup

Clone this repo.
Download the data: all json files are already in data/, but you will need to download the images. For NLVR2 images (around 15G), please fill out the Google Form by the NLVR2 group. MaRVL images (around 2.5G) can be downloaded from the Dataverse portal.
Run the experiments: we experiment various models, including GPT-4V, LLaVA, mBLIP, CCLM, UNITERs, and ViLT. Each model requires different environment, so we include a README.md file for each model in its directory, indicating how to reproduce the experiments.

Experiments

Here, we briefly describe all experiments we run. More details can be found in our paper. We first evaluate models' performance on NLVR2 and MaRVL under zero-shot or finetuned settings, and propose three interventions: translate test, visual programming, and reasoning with captions.

Evaluation

Zero-Shot

In this setting, models are not specifically fine-tuned for the task of visual reasoning. Instead, we use LMM's zero-shot learning abilities by providing them prompts with instructions on how to solve the task. We evaluate the following models:

GPT-4V: Instructions on how to reproduce the results are provided in gpt4v/README.md.

LLaVA: Instructions on how to reproduce the results are provided in LLaVA/README.md.

mBLIP: Instructions on how to reproduce the results are provided in mBLIP/README.md.

Finetuned

In this setting, models are finetuned on the English NLVR2 dataset, and test them on both NLVR2 and MaRVL. We evaluate the following models:

mBLIP: Instructions on how to reproduce the results are provided in mBLIP/READEME.md.

CCLM: Instructions on how to reproduce the results are provided in CCLM/READEME.md.

UNITERs: We evaluate xUNITER and mUNITER, and instructions on how to reproduce the results are provided in uniters/READEME.md.

Interventions

We perform three interventions. The pipelines for each can be found in the flow chart below.

Translate Test

Instead of evaluating models on multilingual text, we evaluate models on the MaRVL dataset with multilingual text translated to English using Google translate API.

Visual Programming

We use the approach introduced in VisProg, solving multimodal reasoning using LLMs to generate visual programs, that leverage off-the-shelf computer vision models for image processing during inference.

Reasoning with Captions

We first caption both images, and use an LLM to reason about the given statement with the two captions, rather than with the two images.

Zeno Result Visualization

We upload all our experimental results to Zeno, a platform which we use to visualize our results. An example visualization is shown below.

Note: Our paper is under anonymous review. Since the links to zeno is unanonymized, we will post the links to zeno post-review.

TODO

Our paper is under anonymous review. We will post the citation to our paper post-review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Visual Reasoning

Brief Overview

Repo Setup

Experiments

Evaluation

Zero-Shot

Finetuned

Interventions

Translate Test

Visual Programming

Reasoning with Captions

Zeno Result Visualization

TODO

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
CCLM		CCLM
LLaVA		LLaVA
VisProg		VisProg
data		data
gpt4v		gpt4v
images		images
mBLIP		mBLIP
uniters		uniters
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

License

yueqis/Multilingual_Visual_Reasoning

Folders and files

Latest commit

History

Repository files navigation

Multilingual Visual Reasoning

Brief Overview

Repo Setup

Experiments

Evaluation

Zero-Shot

Finetuned

Interventions

Translate Test

Visual Programming

Reasoning with Captions

Zeno Result Visualization

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages