This is the official repository of my report JointXplore - Testing and Exploring Joint Visual-NLP Networks by Leonard Schenk for the course Testing and Verification in Machine Learning.
There is no software engineering without testing. This sentence has always been true and is even more important in the light of probabilistic, potentially safety-critical neural networks. Few work has been conducted 12 to test multimodal networks on tasks such as Visual Question Answering (VQA). To this end, this work presents three different experiments that measure accuracy, coverage and robustness on two different multimodal neural network architectures. Additionally, this work examines the effect of using only the textual input to perform VQA in each of these settings. The results reveal that both architectures have a relatively high performance when using only text. Furthermore, different coverage metrics show that the text input alone discovers less internal states compared to the combined visionlanguage input. Finally, using state of the art adversarial attack methods point out the vulnerability of multimodal neural networks.
- Install requirements with:
pip install -r requirements.txt
- In root folder, install LAVIS as described in in the official repository
-
Download VQA 2.0 train and validation set incl. images from the official webpage and save it under
data/
-
run
python load_helper.py
to create pre-filtered datasets without greyscale images and with smaller size
The code can be run with the following command:
python run.py --data_path <data_path="./data/">
--task <["coverage_regions", "coverage", "adversarial_text"]> --model <["vilt", "albef"]>
--use_rnd
--num_samples <[2500 (coverage), 5000 (coverage_regions)]> --activations_file <path to file that was saved after coverage regions>
Examples:
Coverage regions with ViLT and full images:
python run.py --task "coverage_regions" --model "vilt" --num_samples 5000
Coverage metrics with ALBEF and random images:
python run.py --task "coverage" --model "albef" --num_samples 2500 --use_rnd
Adversarial Attack with ViLT and random images:
python run.py --task "adversarial_text" --model "vilt" --num_attacks 80 --use_rnd
For more training options and explanations, please run scripts/train.py -h.
I would like to thank Salesforce/LAVIS for the ALBEF model, dandelin/ViLT ViLT model on huggingface and visualqa for the dataset.
Footnotes
-
Kim, Jaekyum, et al. "Robust deep multi-modal learning based on gated information fusion network." Asian Conference on Computer Vision. Springer, Cham, 2018. ↩
-
Wang, Xuezhi, Haohan Wang, and Diyi Yang. "Measure and Improve Robustness in NLP Models: A Survey." arXiv preprint arXiv:2112.08313 (2021). ↩