JointXplore - Testing and Exploring Joint Visual-NLP Networks

This is the official repository of my report JointXplore - Testing and Exploring Joint Visual-NLP Networks by Leonard Schenk for the course Testing and Verification in Machine Learning.

Abstract

There is no software engineering without testing. This sentence has always been true and is even more important in the light of probabilistic, potentially safety-critical neural networks. Few work has been conducted ¹² to test multimodal networks on tasks such as Visual Question Answering (VQA). To this end, this work presents three different experiments that measure accuracy, coverage and robustness on two different multimodal neural network architectures. Additionally, this work examines the effect of using only the textual input to perform VQA in each of these settings. The results reveal that both architectures have a relatively high performance when using only text. Furthermore, different coverage metrics show that the text input alone discovers less internal states compared to the combined visionlanguage input. Finally, using state of the art adversarial attack methods point out the vulnerability of multimodal neural networks.

Installation

Install requirements with:

pip install -r requirements.txt

In root folder, install LAVIS as described in in the official repository

Dataset

Download VQA 2.0 train and validation set incl. images from the official webpage and save it under data/
run

python load_helper.py

to create pre-filtered datasets without greyscale images and with smaller size

Usage

The code can be run with the following command:

python run.py --data_path <data_path="./data/">  
--task <["coverage_regions", "coverage", "adversarial_text"]> --model <["vilt", "albef"]> 
--use_rnd
--num_samples <[2500 (coverage), 5000 (coverage_regions)]> --activations_file <path to file that was saved after coverage regions>

Examples:

Coverage regions with ViLT and full images:

python run.py --task "coverage_regions" --model "vilt" --num_samples 5000

Coverage metrics with ALBEF and random images:

python run.py --task "coverage" --model "albef" --num_samples 2500 --use_rnd

Adversarial Attack with ViLT and random images:

python run.py --task "adversarial_text" --model "vilt" --num_attacks 80 --use_rnd

For more training options and explanations, please run scripts/train.py -h.

Acknowledgements

I would like to thank Salesforce/LAVIS for the ALBEF model, dandelin/ViLT ViLT model on huggingface and visualqa for the dataset.

Kim, Jaekyum, et al. "Robust deep multi-modal learning based on gated information fusion network." Asian Conference on Computer Vision. Springer, Cham, 2018. ↩
Wang, Xuezhi, Haohan Wang, and Diyi Yang. "Measure and Improve Robustness in NLP Models: A Survey." arXiv preprint arXiv:2112.08313 (2021). ↩

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
adversarial_results		adversarial_results
coverage_results		coverage_results
data		data
.gitignore		.gitignore
JointXplore.pdf		JointXplore.pdf
README.md		README.md
attach_hooks.py		attach_hooks.py
eval_helper.py		eval_helper.py
file_handler.py		file_handler.py
load_helper.py		load_helper.py
prepare_attack.py		prepare_attack.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JointXplore - Testing and Exploring Joint Visual-NLP Networks

Abstract

Installation

Dataset

Usage

Acknowledgements

About

Releases

Packages

Languages

LeoGitGuy/JointXplore

Folders and files

Latest commit

History

Repository files navigation

JointXplore - Testing and Exploring Joint Visual-NLP Networks

Abstract

Installation

Dataset

Usage

Acknowledgements

Footnotes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages