NAACL 2025
Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes
Small-scale vision-language models (VLMs) struggle to balance descriptiveness and hallucination when fine-tuned on long, detailed captions. We introduce Decomposed NLI (DNLI), a fine-grained evaluation framework that assesses caption quality by breaking down generated text into individual propositions. Our findings show that reducing caption complexity or using standard data curation alone is insufficient to mitigate hallucinations effectively. To address this, we propose Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that aligns training captions with the model's existing knowledge and visual understanding. KnowAda reduces hallucinations while preserving high descriptiveness, outperforming baselines on both automatic metrics and human evaluations.
git clone https://github.com/moranyanuka/knowada.git
cd knowada
To set up our environment, please run:
conda env create -f environment.yml
conda activate knowada
Add your Gemini API key:
export API_KEY='<your-key>'
First, create a CSV file with the following columns:
original_description
: Contains the ground-truth image description from the evaluation datasetgenerated_description
: Contains the generated description of the VLM to evaluate
See an example of such a file here.
Then, run the following command:
python eval/generate_propositions.py \
--df_path <path-to-model-generation> \
--output_dir <path-to-evaluation-output>
The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir
.
To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:
python run.py \
--generate_questions True \
--generate_answers True \
--generate_judgments True \
--generate_rewritten_descriptions True \
--output_folder <path-to-output-directory>
This will generate the following files:
questions.csv
: Contains the generated questions based on the image descriptionsanswers.csv
: The VLM's sampled answers to each questionjudgments.csv
: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)difficult_questions_list.csv
: Contains for each descriptions, all the questions that are considered unknown for a given thresholdrewritten_captions.csv
: The final rewritten captions based on the unknown questions
You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)
If you find this useful for your research, please cite the following:
@article{yanuka2024bridging,
title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
journal={arXiv preprint arXiv:2411.09018},
year={2024}
}