Skip to content

JointXplore - Testing and Exploring Joint Visual-NLP Networks

Notifications You must be signed in to change notification settings

LeoGitGuy/JointXplore

Repository files navigation

JointXplore - Testing and Exploring Joint Visual-NLP Networks

This is the official repository of my report JointXplore - Testing and Exploring Joint Visual-NLP Networks by Leonard Schenk for the course Testing and Verification in Machine Learning.

Abstract

There is no software engineering without testing. This sentence has always been true and is even more important in the light of probabilistic, potentially safety-critical neural networks. Few work has been conducted 12 to test multimodal networks on tasks such as Visual Question Answering (VQA). To this end, this work presents three different experiments that measure accuracy, coverage and robustness on two different multimodal neural network architectures. Additionally, this work examines the effect of using only the textual input to perform VQA in each of these settings. The results reveal that both architectures have a relatively high performance when using only text. Furthermore, different coverage metrics show that the text input alone discovers less internal states compared to the combined visionlanguage input. Finally, using state of the art adversarial attack methods point out the vulnerability of multimodal neural networks.

Installation

  1. Install requirements with:
pip install -r requirements.txt
  1. In root folder, install LAVIS as described in in the official repository

Dataset

  1. Download VQA 2.0 train and validation set incl. images from the official webpage and save it under data/

  2. run

python load_helper.py

to create pre-filtered datasets without greyscale images and with smaller size

Usage

The code can be run with the following command:

python run.py --data_path <data_path="./data/">  
--task <["coverage_regions", "coverage", "adversarial_text"]> --model <["vilt", "albef"]> 
--use_rnd
--num_samples <[2500 (coverage), 5000 (coverage_regions)]> --activations_file <path to file that was saved after coverage regions>

Examples:

Coverage regions with ViLT and full images:

python run.py --task "coverage_regions" --model "vilt" --num_samples 5000

Coverage metrics with ALBEF and random images:

python run.py --task "coverage" --model "albef" --num_samples 2500 --use_rnd

Adversarial Attack with ViLT and random images:

python run.py --task "adversarial_text" --model "vilt" --num_attacks 80 --use_rnd

For more training options and explanations, please run scripts/train.py -h.

Acknowledgements

I would like to thank Salesforce/LAVIS for the ALBEF model, dandelin/ViLT ViLT model on huggingface and visualqa for the dataset.

Footnotes

  1. Kim, Jaekyum, et al. "Robust deep multi-modal learning based on gated information fusion network." Asian Conference on Computer Vision. Springer, Cham, 2018.

  2. Wang, Xuezhi, Haohan Wang, and Diyi Yang. "Measure and Improve Robustness in NLP Models: A Survey." arXiv preprint arXiv:2112.08313 (2021).

About

JointXplore - Testing and Exploring Joint Visual-NLP Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages