VISCO

Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

🌐 Project | 🤗 Dataset | 📖 Paper

Outline:

Introduction
Data
Environment
Run models
Run evaluation

Introduction

VISCO is a benchmark for evaluating the critique and correction capabilities of LVLMs. VISCO contains:

1645 pairs of questions and LVLM-generated answers. Each answer includes a chain-of-thought with multiple reasoning steps.
5604 step-wise annotations of critique, showing whether each step is correct or incorrect, along with a natural language explanation if a step is incorrect.

Data

Please download our data from here. Due to the variable data type, the huggingface dataset viewer does not work properly. Please use browse-data.ipynb to browse the data.

Each line in test.jsonl corresponds to a data point, with a question, an LVLM-generated answer, and critique annotations for each step. For the data example shown above, its json representation is:

{
  "id": "test1||Reasoning||Math||MathVista||5",
  "question": "What is the center of symmetry of this function?\nChoices:\n(A) (0, 0)\n(B) (-1, 0)\n(C) (2, 0)",
  "image": IMAGE_BASE64_REPR,
  "label": "(0, 0)",
  "response": {
    "reasoning": [
      "A function with a center of symmetry at a point (a, b) implies the graph of the function is symmetric when rotated 180 degrees about this point.",
      "Observing the graph, it appears that this symmetry exists around the point (2, 0).",
      "Points on opposite sides of (2, 0) have values equidistant from this center, confirming symmetry around this point."
    ],
    "answer": "(2, 0)"
  },
  "answer_correctness": false,
  "reasoning_correctness": [true, false, false],
  "reasoning_critic": [
    ["", "", ""],
    ["The symmetry exists around (0, 0), not (2, 0).", SECOND_REFERENCE, THIRD_REFERENCE],
    ["Points on opposite sides of (2, 0) do not have values equidistant from this center.", SECOND_REFERENCE, THIRD_REFERENCE],
  ],
  "meta_data": {
    "critic_superskill": "Reasoning",
    "critic_skill": "Math",
    "src_dataset": "MathVista",
    "src_model": "GPT-4o",
    ...META_DATA_FOR_ORIGINAL_DATASET
  }
}

Notes:

The field response is the answer generated by LVLMs. It includes a chain-of-thought (field reasoning) and the final answer (field answer).
Annotations for critique include three parts: the binary critique for final answer (answer_correctness), the binary critique for each step (reasoning_correctness), and the natural language critique for each step (reasoning_critic).
Note that for each step, we have three different references produced by three different annotators. All references are considered when doing the final evaluation.
Also note that we only provide natural language critiques for incorrect steps.

Environment

To install the minimal requirements:

pip install -r requirements.txt

However, note that this requirement does not include requirements for fast serving frameworks such as vllm, lmdeploy and sglang. To install these packages, please first install pytorch, and then follow their documents to install their latest versions respectively. If you want to use multiple fast serving frameworks, it is recommended to maintain multiple environments, one for each fast serving framework, because they may have conflicts in dependencies.

Run models

Download the data from huggingface and put test.jsonl under this directory. Then, use the following scripts:

Critique

Run python infer_critique.py --input test.jsonl --output OUTPUT_FILE

If you're using proprietary LVLMs such as OpenAI, Anthropic and Gemini models, use --model XXX to specify the model and use --api_key to provide your API key. The proprietary models we test include gpt-4o-2024-08-06, claude-3-5-sonnet-20240620 and gemini-1.5-pro
If you're using open LVLMs, you can locally launch an OpenAI compatible server, and then use the same script. Then, you should specify the following arguments: set --base_url as your server URL, --model as your model name or auto, and set --api_key to your API key. An example of how to launch OpenAI compatible server with vllm is here.
Alternatively, you can specify --model XXX and add --launch_locally BACKEND argument to the script, so the script will automatically launch a server and make requests to the launched server. Supported backend include lmdeploy, vllm and sglang. Note that this requires you to properly install the backend packages first. The framework we use for evaluating each model is as follows:

Framework	Model(s)
vllm	Qwen2-VL, Molmo, Llama-3.2, NVLM
lmdeploy	InternVL2, DeepSeek-VL, LLaVA-v1.5, LLaVA-v1.6, Qwen-VL, Prometheus-Vision
sglang	LLaVA-OV, LLaVA-Critic

If you want to use your custom inference code, please rewrite def infer in utils.py.

Critique with LookBack

Run python infer_critique_lookback.py --input test.jsonl --output OUTPUT_FILE. The other arguments are the same as infer_critique.py. Note that this script will be slower and takes more API calls, so remember to monitor your API usage.

Correction

Run python infer_correction.py --input test.jsonl --output OUTPUT_FILE

For correction with human critique, use argument --critique human. The script will use the critique annotations in test.jsonl.
For correction with model-generated critique, use argument --critique CRITIQUE_FILE, where CRITIQUE_FILE is the output file generated by infer_critique.py.
By default, the correction script use the full critique, including answer-level critique, step-level critique and explanation-level critique. If you only want to use more coarse-grained critique, set --critique_setting A to only use answer-level binary critique, and set --critique_setting AS to only use answer-level and step-level binary critique.

The other arguments are the same as infer_critique.py.

Run evaluation

Critique

First, run LLM-assisted evaluation of explanation-level F1:

python gpt_evaluate.py YOUR_OUTPUT_FILE --input test.jsonl

Remember to set environment variable OPENAI_API_KEY so the script can have OpenAI access. The evaluation results will be saved to a cache file YOUR_OUTPUT_FILE.gpt_evaluate_cache.

Then, run evaluate.py to calculate all the full metrics including VISCore.

python evaluate.py YOUR_OUTPUT_FILE --input test.jsonl --task critique

Correction

python evaluate.py YOUR_OUTPUT_FILE --input test.jsonl --task correction

Citation

Please cite our paper if this repository inspires your work!

@misc{wu2024viscobenchmarkingfinegrainedcritique,
      title={VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning}, 
      author={Xueqing Wu and Yuheng Ding and Bingxuan Li and Pan Lu and Da Yin and Kai-Wei Chang and Nanyun Peng},
      year={2024},
      eprint={2412.02172},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02172}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts		prompts
src_evaluation		src_evaluation
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
browse-data.ipynb		browse-data.ipynb
evaluate.py		evaluate.py
gpt_evaluate.py		gpt_evaluate.py
infer_correction.py		infer_correction.py
infer_critique.py		infer_critique.py
infer_critique_lookback.py		infer_critique_lookback.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VISCO

Introduction

Data

Environment

Run models

Critique

Critique with LookBack

Correction

Run evaluation

Critique

Correction

Citation

About

Releases

Packages

Languages

License

PlusLabNLP/VISCO

Folders and files

Latest commit

History

Repository files navigation

VISCO

Introduction

Data

Environment

Run models

Critique

Critique with LookBack

Correction

Run evaluation

Critique

Correction

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages