Skip to content

[EMNLP2024 Findings] VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

License

Notifications You must be signed in to change notification settings

shirley-wu/vdebugger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VDebugger

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs, EMNLP Findings 2024

Paper, Website, Models and Data

Outlines

Environment Setup

This code is partially adapted from ViperGPT. We sincerely thank the authors of ViperGPT for their great work!

To setup the environment, you should:

  1. Clone recursively:
git clone --recurse-submodules [email protected]:shirley-wu/vdebugger.git
  1. Install pytorch based on your own environment. We installed torch==2.1.2 with cuda 12.1
  2. Install dependencies:
pip install -r requirements.txt
  1. Setup ViperGPT environments by:
cd viper
bash download_models.sh
export PATH=/usr/local/cuda/bin:$PATH
cd GLIP
python setup.py clean --all build develop --user
  1. If you need to use openai APIs: write api key into viper/api.key

Dataset Setup

Please follow the guidelines below to download each dataset:

  1. GQA: https://cs.stanford.edu/people/dorarad/gqa/download.html. The file structure should look as follows:
gqa/
├── questions/
│   ├── readme.txt
│   ├── {val, test, testdev, challenge}_{all, balanced}_questions.json
│   ├── submission_all_questions.json
│   ├── train_balanced_questions.json
│   ├── train_all_questions/
└── images/
    └── *.jpg
  1. TallyQA: https://github.com/manoja328/TallyQA_dataset. The file structure should look as follows:
tallyqa/
├── {test, train}.json
└── {train2014, val2014, VG_100K, VG_100K_2}/
    └── *.jpg
  1. NLVRv2: https://github.com/lil-lab/nlvr/tree/master/nlvr2. The file structure should look as follows:
nlvr2/
├── balanced_{dev, test1, test2, train}.jsonl
└── {dev, test1, test2, train}/
    └── *.png
  1. RefCOCO*: https://github.com/lichengunc/refer. The file structure should look as follows:
refer/
├── refcoco/
│   ├── instances.json
│   ├── refs(google).p
│   └── refs(unc).p
├── refcoco+/
│   ├── instances.json
│   └── refs(unc).p
├── refcocog/
│   ├── instances.json
│   ├── refs(google).p
│   └── refs(umd).p
└── {train2014, train2017, val2014, val2017}/
    └── *.jpg
  1. COVR: https://covr-dataset.github.io/. The file structure should look as follows:
covr/
├── {train, val, test}.jsonl
├── gqa_images/
│   └── *.jpg
└── imSitu_images/
    └── {adjusting, ...}/
        └── *.jpg
  1. RSVG: https://github.com/ZhanYang-nwpu/RSVG-pytorch. The file structure should look as follows:
rsvg/
├── {train, val, test.txt}
├── Annotations/
│   └── *.xml
└── JPEGImages/
    └── *.jpg

Generation and Execution of Visual Programs

Go to viper/ for this step. We recommend first generating and then executing the visual programs in two separate steps. Take GQA dataset as an example:

  1. Generate programs:
CONFIG_NAMES=generate/gqa python main_batch_generate.py

This script will load the configuration under config/generate/gqa.yaml. Please remember to change YOUR_DATA_DIR into your data directory. The generated code will be saved in a csv under code field

  1. Execute and evaluate programs:
CONFIG_NAMES=execute/gqa python main_batch_execute.py

This script will load the configuration under config/execute/gqa.yaml. Please also remember to update YOUR_DATA_DIR, and change the cached_codex_path: field into the csv produced in step 1. The accuracy / IoU will be computed.

  1. If you want to obtain execution feedback:
CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP

You can use the same configuration as in step 2. If you want to run multiple main_batch_trace.py in the same time, please use different A_RANDOM_STAMP for different processes. The execution feedback will be saved in a csv under traced field.

Inference of VDebugger

For inference with VDebugger, it is required to first generate and execute visual programs, and obtain a csv file containing traced field. Then, go to vdebugger/. Take GQA dataset and VDebugger/VDebugger-{critic, refiner}-generalist-13B as an example:

# Step 1: infer critic
python infer_critic.py VDebugger/VDebugger-critic-generalist-13B --input YOUR_CSV_CONTAINING_TRACED_FIELD --dataset gqa  # output file will be written to critic-infer.csv
# Step 2: infer refiner
python infer_refine.py critic-infer.csv VDebugger/VDebugger-refiner-generalist-13B  # output file will be written to critic-refine-infer.csv

Then you can execute the programs in critic-refine-infer.csv as in step 2 of Generation and Execution of Visual Programs

Run VDebugger for Multiple Iterations

To run VDebugger for T iterations (T > 1), you first need to generate the initial programs and collect their execution feedback as in step 1 and 3 in Generation and Execution of Visual Programs. Then, you need to repeat the steps below for T times:

  1. Infer critic, as in Inference of VDebugger;
  2. Infer refiner, as in Inference of VDebugger;
  3. Collect execution feedback for the new programs generated by refiner, as in step 2 in Generation and Execution of Visual Programs. The next iteration will be run on top of the feedback collected in this step.

Then after T iterations, evaluate the final programs as in step 2 in Generation and Execution of Visual Programs.

Since the major computational overhead comes from program execution (i.e. step 3 in each iteration), you can use the helper scripts remove_dup.py and merge_csv.py in vdebugger/interative_helper to reduce the redundant execution:

  • Before step 3 in each iteration, remove the programs that are duplicate as last iteration, by executing:
python remove_dup.py PROGRAM_CSV_FROM_LAST_ITERATION PROGRAM_CSV_FOR_THIS_ITERATION

which will produce a dedup file PROGRAM_CSV_FOR_THIS_ITERATION_DEDUP

  • Then collect execution feedback for the resulted PROGRAM_CSV_FOR_THIS_ITERATION_DEDUP.
  • After collecting their feedback, merge the execution feedback from the last iteration and the current iteration:
python merge_csv.py EXECUTION_FEEDBACK_FROM_LAST_ITERATION EXECUTION_FEEDBACK_FOR_THIS_ITERATION_DEDUP

which will produce a file EXECUTION_FEEDBACK_FOR_THIS_ITERATION_MERGED containing merged execution results. For the next iteration, use the merged execution results.

There will still be some repeated computation within step 1 and 2 in this iteration, but that will be tolerable. If you are concerned, you can modify the scripts by yourself to avoid the computation.

Training of VDebugger

If you want to reproduce our training of VDebugger, please use vdebugger/training_scripts/train_{critic, refiner}.sh. You will need to install deepspeed==0.14.0.

Error Injection

To perform error injection and generate incorrect programs as described in Section 4 of our paper, you first need a .csv file containing the visual programs generated for the training set and their execution results. Then, please go to vdebugger/ and run:

python error_injection.py YOUR_CSV_FILE --error_injection {greedy, mask-best}

Citation

Please cite our paper if this repository inspires your work.

@inproceedings{wu-etal-2024-vdebugger,
    title = "{VD}ebugger: Harnessing Execution Feedback for Debugging Visual Programs",
    author = "Wu, Xueqing  and
      Lin, Zongyu  and
      Zhao, Songyan  and
      Wu, Te-Lin  and
      Lu, Pan  and
      Peng, Nanyun  and
      Chang, Kai-Wei",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.575",
    doi = "10.18653/v1/2024.findings-emnlp.575",
    pages = "9845--9860"
}

About

[EMNLP2024 Findings] VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published