Skip to content

[NAACL 2025] Bridging the Visual Gap: Fine Tuning Multimodal Models with Knowledge Adapted Captions

Notifications You must be signed in to change notification settings

moranyanuka/KnowAda

Repository files navigation

Bridging the Visual Gap: Fine Tuning Multimodal Models With Knowledge Adapted Captions

NAACL 2025

Moran Yanuka, Assaf Ben-Kish, Yonatan Bitton, Idan Szpektor, Raja Giryes

Small-scale vision-language models (VLMs) struggle to balance descriptiveness and hallucination when fine-tuned on long, detailed captions. We introduce Decomposed NLI (DNLI), a fine-grained evaluation framework that assesses caption quality by breaking down generated text into individual propositions. Our findings show that reducing caption complexity or using standard data curation alone is insufficient to mitigate hallucinations effectively. To address this, we propose Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that aligns training captions with the model's existing knowledge and visual understanding. KnowAda reduces hallucinations while preserving high descriptiveness, outperforming baselines on both automatic metrics and human evaluations.


Setup

Clone Project

git clone https://github.com/moranyanuka/knowada.git
cd knowada

Create the Environment

To set up our environment, please run:

conda env create -f environment.yml
conda activate knowada

Add your Gemini API key:

export API_KEY='<your-key>'

DNLI Dense Caption Evaluation

Running the DNLI evaluation of dense captions for your VLM

First, create a CSV file with the following columns:

  • original_description: Contains the ground-truth image description from the evaluation dataset
  • generated_description: Contains the generated description of the VLM to evaluate

See an example of such a file here.

Then, run the following command:

python eval/generate_propositions.py \
       --df_path <path-to-model-generation> \
       --output_dir <path-to-evaluation-output>

The script will write the propositions of the ground truth descriptions, the propositions of the generated descriptions, and the final metrics to output_dir.

KnowAda

To rewrite the DOCCI captions according to the knowledge gaps of PaliGemma, run the following script:

python run.py \
       --generate_questions True \
       --generate_answers True \
       --generate_judgments True \
       --generate_rewritten_descriptions True \
       --output_folder <path-to-output-directory>

This will generate the following files:

  • questions.csv: Contains the generated questions based on the image descriptions
  • answers.csv: The VLM's sampled answers to each question
  • judgments.csv: The judgments determining whether a given answer is correct on a scale of 1-3 (1 is completely incorrect, 3 is completely correct)
  • difficult_questions_list.csv: Contains for each descriptions, all the questions that are considered unknown for a given threshold
  • rewritten_captions.csv: The final rewritten captions based on the unknown questions

You can adjust some of the parameters in each stage of the pipeline using the config files (e.g., the train/test split, the difficulty_threshold for determining if a question is unknown, etc.)

Citation

If you find this useful for your research, please cite the following:

@article{yanuka2024bridging,
  title={Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions},
  author={Yanuka, Moran and Kish, Assaf Ben and Bitton, Yonatan and Szpektor, Idan and Giryes, Raja},
  journal={arXiv preprint arXiv:2411.09018},
  year={2024}
}

About

[NAACL 2025] Bridging the Visual Gap: Fine Tuning Multimodal Models with Knowledge Adapted Captions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages