Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

NAACL 2025

Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes

Small-scale vision-language models (VLMs) struggle to balance descriptiveness and hallucination when fine-tuned on long, detailed captions. We introduce Decomposed NLI (DNLI), a fine-grained evaluation framework that assesses caption quality by breaking down generated text into individual propositions. Our findings show that reducing caption complexity or using standard data curation alone is insufficient to mitigate hallucinations effectively. To address this, we propose Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that aligns training captions with the model's existing knowledge and visual understanding. KnowAda reduces hallucinations while preserving high descriptiveness, outperforming baselines on both automatic metrics and human evaluations.

Setup

Clone Project

git clone https://github.com/moranyanuka/knowada.git
cd knowada

Create the Environment

To set up our environment, please run:

conda env create -f environment.yml
conda activate knowada

Add your Gemini API key:

export API_KEY='<your-key>'

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

First, create a CSV file with the following columns:

original_description: Contains the ground-truth image description from the evaluation dataset
generated_description: Contains the generated description of the VLM to evaluate

See an example of such a file here.

Then, run the following command:

python eval/generate_propositions.py \
       --df_path <path-to-model-generation> \
       --output_dir <path-to-evaluation-output>

The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir.

KnowAda

To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:

python run.py \
       --generate_questions True \
       --generate_answers True \
       --generate_judgments True \
       --generate_rewritten_descriptions True \
       --output_folder <path-to-output-directory>

This will generate the following files:

questions.csv: Contains the generated questions based on the image descriptions
answers.csv: The VLM's sampled answers to each question
judgments.csv: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)
difficult_questions_list.csv: Contains for each descriptions, all the questions that are considered unknown for a given threshold
rewritten_captions.csv: The final rewritten captions based on the unknown questions

You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)

Citation

If you find this useful for your research, please cite the following:

@article{yanuka2024bridging,
  title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
  author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
  journal={arXiv preprint arXiv:2411.09018},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit moranyanuka Add tldr Feb 7, 2025 481e00c · Feb 7, 2025 History 14 Commits
configs	configs	Initial release	Jan 26, 2025
eval	eval	Initial release	Jan 26, 2025
examples	examples	Remove redundant columns	Jan 26, 2025
images	images	Adding results figure	Feb 7, 2025
prompts	prompts	Remove old prompt	Feb 4, 2025
tasks	tasks	Initial release	Jan 26, 2025
utils	utils	Initial release	Jan 26, 2025
.gitignore	.gitignore	Initial release	Jan 26, 2025
README.md	README.md	Add tldr	Feb 7, 2025
dataset.py	dataset.py	Initial release	Jan 26, 2025
environment.yml	environment.yml	Initial release	Jan 26, 2025
judge.py	judge.py	Initial release	Jan 26, 2025
models.py	models.py	Initial release	Jan 26, 2025
run.py	run.py	Initial release	Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

Setup

Clone Project

Create the Environment

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

KnowAda

Citation

About

Releases

Packages

Languages

moranyanuka/KnowAda

Folders and files

Latest commit

History

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

Setup

Clone Project

Create the Environment

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

KnowAda

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages