VDebugger: Harnessing Execution Feedback for Debugging Visual Programs, EMNLP Findings 2024
Paper, Website, Models and Data
- Environment Setup
- Dataset Setup
- Generation and Execution of Visual Programs
- Inference of VDebugger
- Error Injection
This code is partially adapted from ViperGPT. We sincerely thank the authors of ViperGPT for their great work!
To setup the environment, you should:
- Clone recursively:
git clone --recurse-submodules [email protected]:shirley-wu/vdebugger.git
- Install pytorch based on your own environment. We installed
torch==2.1.2
with cuda 12.1 - Install dependencies:
pip install -r requirements.txt
- Setup ViperGPT environments by:
cd viper
bash download_models.sh
export PATH=/usr/local/cuda/bin:$PATH
cd GLIP
python setup.py clean --all build develop --user
- If you need to use openai APIs: write api key into
viper/api.key
Please follow the guidelines below to download each dataset:
- GQA: https://cs.stanford.edu/people/dorarad/gqa/download.html. The file structure should look as follows:
gqa/
├── questions/
│ ├── readme.txt
│ ├── {val, test, testdev, challenge}_{all, balanced}_questions.json
│ ├── submission_all_questions.json
│ ├── train_balanced_questions.json
│ ├── train_all_questions/
└── images/
└── *.jpg
- TallyQA: https://github.com/manoja328/TallyQA_dataset. The file structure should look as follows:
tallyqa/
├── {test, train}.json
└── {train2014, val2014, VG_100K, VG_100K_2}/
└── *.jpg
- NLVRv2: https://github.com/lil-lab/nlvr/tree/master/nlvr2. The file structure should look as follows:
nlvr2/
├── balanced_{dev, test1, test2, train}.jsonl
└── {dev, test1, test2, train}/
└── *.png
- RefCOCO*: https://github.com/lichengunc/refer. The file structure should look as follows:
refer/
├── refcoco/
│ ├── instances.json
│ ├── refs(google).p
│ └── refs(unc).p
├── refcoco+/
│ ├── instances.json
│ └── refs(unc).p
├── refcocog/
│ ├── instances.json
│ ├── refs(google).p
│ └── refs(umd).p
└── {train2014, train2017, val2014, val2017}/
└── *.jpg
- COVR: https://covr-dataset.github.io/. The file structure should look as follows:
covr/
├── {train, val, test}.jsonl
├── gqa_images/
│ └── *.jpg
└── imSitu_images/
└── {adjusting, ...}/
└── *.jpg
- RSVG: https://github.com/ZhanYang-nwpu/RSVG-pytorch. The file structure should look as follows:
rsvg/
├── {train, val, test.txt}
├── Annotations/
│ └── *.xml
└── JPEGImages/
└── *.jpg
Go to viper/
for this step. We recommend first generating and then executing the visual programs in two separate steps. Take GQA dataset as an example:
- Generate programs:
CONFIG_NAMES=generate/gqa python main_batch_generate.py
This script will load the configuration under config/generate/gqa.yaml
. Please remember to change YOUR_DATA_DIR into your data directory. The generated code will be saved in a csv under code
field
- Execute and evaluate programs:
CONFIG_NAMES=execute/gqa python main_batch_execute.py
This script will load the configuration under config/execute/gqa.yaml
. Please also remember to update YOUR_DATA_DIR, and change the cached_codex_path:
field into the csv produced in step 1. The accuracy / IoU will be computed.
- If you want to obtain execution feedback:
CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP
You can use the same configuration as in step 2. If you want to run multiple main_batch_trace.py
in the same time, please use different A_RANDOM_STAMP
for different processes. The execution feedback will be saved in a csv under traced
field.
For inference with VDebugger, it is required to first generate and execute visual programs, and obtain a csv file containing traced
field. Then, go to vdebugger/
. Take GQA dataset and VDebugger/VDebugger-{critic, refiner}-generalist-13B as an example:
# Step 1: infer critic
python infer_critic.py VDebugger/VDebugger-critic-generalist-13B --input YOUR_CSV_CONTAINING_TRACED_FIELD --dataset gqa # output file will be written to critic-infer.csv
# Step 2: infer refiner
python infer_refine.py critic-infer.csv VDebugger/VDebugger-refiner-generalist-13B # output file will be written to critic-refine-infer.csv
Then you can execute the programs in critic-refine-infer.csv
as in step 2 of Generation and Execution of Visual Programs
To run VDebugger for T
iterations (T
> 1), you first need to generate the initial programs and collect their execution feedback as in step 1 and 3 in Generation and Execution of Visual Programs. Then, you need to repeat the steps below for T
times:
- Infer critic, as in Inference of VDebugger;
- Infer refiner, as in Inference of VDebugger;
- Collect execution feedback for the new programs generated by refiner, as in step 2 in Generation and Execution of Visual Programs. The next iteration will be run on top of the feedback collected in this step.
Then after T
iterations, evaluate the final programs as in step 2 in Generation and Execution of Visual Programs.
Since the major computational overhead comes from program execution (i.e. step 3 in each iteration), you can use the helper scripts remove_dup.py
and merge_csv.py
in vdebugger/interative_helper to reduce the redundant execution:
- Before step 3 in each iteration, remove the programs that are duplicate as last iteration, by executing:
python remove_dup.py PROGRAM_CSV_FROM_LAST_ITERATION PROGRAM_CSV_FOR_THIS_ITERATION
which will produce a dedup file PROGRAM_CSV_FOR_THIS_ITERATION_DEDUP
- Then collect execution feedback for the resulted
PROGRAM_CSV_FOR_THIS_ITERATION_DEDUP
. - After collecting their feedback, merge the execution feedback from the last iteration and the current iteration:
python merge_csv.py EXECUTION_FEEDBACK_FROM_LAST_ITERATION EXECUTION_FEEDBACK_FOR_THIS_ITERATION_DEDUP
which will produce a file EXECUTION_FEEDBACK_FOR_THIS_ITERATION_MERGED
containing merged execution results. For the next iteration, use the merged execution results.
There will still be some repeated computation within step 1 and 2 in this iteration, but that will be tolerable. If you are concerned, you can modify the scripts by yourself to avoid the computation.
If you want to reproduce our training of VDebugger, please use vdebugger/training_scripts/train_{critic, refiner}.sh
. You will need to install deepspeed==0.14.0
.
To perform error injection and generate incorrect programs as described in Section 4 of our paper, you first need a .csv
file containing the visual programs generated for the training set and their execution results. Then, please go to vdebugger/
and run:
python error_injection.py YOUR_CSV_FILE --error_injection {greedy, mask-best}
Please cite our paper if this repository inspires your work.
@inproceedings{wu-etal-2024-vdebugger,
title = "{VD}ebugger: Harnessing Execution Feedback for Debugging Visual Programs",
author = "Wu, Xueqing and
Lin, Zongyu and
Zhao, Songyan and
Wu, Te-Lin and
Lu, Pan and
Peng, Nanyun and
Chang, Kai-Wei",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.575",
doi = "10.18653/v1/2024.findings-emnlp.575",
pages = "9845--9860"
}