LLaVA-UHD

A Large Multimodal Model Perceiving Any Aspect Ratio and High-Resolution Images

This repository hosts the code, data, and model weight of LLaVA-UHD, a novel framework that enables Large Multimodal Models (LMMs) to efficiently perceive images in any aspect ratio and high resolution. Notably, our model built on LLaVA-1.5 336×336 supports 6 times larger (i.e., 672×1088) resolution images and achieves 5.7 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within ~1 day on 8 A100 GPUs. Visit our 📃 paper here!

Overview

LLaVA-UHD includes three key components to deal with native-resolution images:

An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding.
A novel compression module (spatially constrained resampler) that further condenses image tokens from visual encoders.
A spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD out- performs established LMMs trained with 2-3 orders of magnitude more data on 8 benchmarks.

Release

-[2024/07/29] 🔥LLaVA-UHD achieves performance improvement on 8 common benchmarks beyong LLaVA-1.5. Our novel projector, spatially constrained resampler, realizes high feature compression and convergence efficiency. Model checkpoints are available in hugging-face.

-[2024/07/01] 📢LLaVA-UHD is accepted by ECCV2024.

Environment Preparing

To reproduce the results of the paper, please set up the Python environment using the following code:

conda create -n llava-uhd python=3.10
conda activate llava-uhd
pip install -r requirements.txt
sh install.sh

Download the checkpoints of CLIP-ViT-L/14 and Vicuna-13B-v1.5. And put them into ./pretrained_models

If something wrong happens, please kindly refer to the issues of LLaVA or submit issues in our repository.

Data Preparing

Pretraining Data: Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here. And put the data into ./playground/data. Also could refer to the documentation of LLaVA for detailed data organization.
Fine-tuning Data: Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as .jpg
- TextCaps: train_val_images
- VisualGenome: part1, part2
Download dataset images as in the finetuning process of LLaVA-1.5, place them in the ./playground/data

Training Script

Please refer to train.sh for pretraining script and fine-tuning script (we comment in the file). If you want to do end-to-end pretraining, fine-tuning and evalutation, please run the following command.

sh train.sh

Evaluation Code

Evaluation script is in eval.sh, you can run

sh eval.sh dir_name_in_checkpoints_new
# e.g. sh eval.sh llava-uhd-144-13b
# llava-uhd-144-13b is the dir_name stored in the path of ./checkpoints_new

For details of data organization, please refer to here for help. We provide the same script to complete the testing.

Citation

If you find LLaVA-UHD useful for your research and applications, please cite using this BibTeX:

@article{guo2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  journal={arXiv preprint arXiv:2403.11703},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
llava		llava
logs		logs
scripts		scripts
.gitignore		.gitignore
LLaVA-UHD.jpg		LLaVA-UHD.jpg
README.md		README.md
debug.log		debug.log
demo.jpg		demo.jpg
demo.log		demo.log
demo.sh		demo.sh
eval.sh		eval.sh
gradio_demo.py		gradio_demo.py
hostfile		hostfile
install.sh		install.sh
playground		playground
pretrained_models		pretrained_models
pyproject.toml		pyproject.toml
train.sh		train.sh
train_light_torchrun.sh		train_light_torchrun.sh
train_torchrun.sh		train_torchrun.sh
train_torchrun_r0.sh		train_torchrun_r0.sh
train_torchrun_r1.sh		train_torchrun_r1.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-UHD

Overview

Release

Environment Preparing

Data Preparing

Training Script

Evaluation Code

Citation

About

Releases

Packages

Languages

guozonghao96/uhd-ground

Folders and files

Latest commit

History

Repository files navigation

LLaVA-UHD

Overview

Release

Environment Preparing

Data Preparing

Training Script

Evaluation Code

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages